The Next-Token Prediction Pipeline: From Raw Text to Transformer Output
Demystifying the token-to-token pipeline replaces the illusion of 'understanding' with a concrete mental model of matrix operations and probability. That clarity helps developers debug generation, choose sampling parameters, and reason about costs, latency, and failure modes.
An LLM's core loop is deceptively simple: given a sequence of tokens, predict the most likely next token. Under the hood, raw text is tokenized into sub-word units and converted to numeric IDs. Those IDs are then mapped to high-dimensional embedding vectors that capture semantic meaning, and positional encodings are added so the model knows word order. The resulting vectors flow through many Transformer blocks, where self-attention layers let each token weigh the relevance of every other token in the context, and feed-forward layers refine the representation. The final hidden state is projected across the entire vocabulary to produce logits, which a softmax turns into a probability distribution. A sampling strategy picks the next token, the output is decoded back to text, and the cycle repeats until a stop condition is met. Every step—from tokenization to decoding—is a deterministic pipeline of matrix math and probability, not magic.
Calling it 'next-word prediction' is a misnomer that hides the sub-word tokenization step; the model predicts tokens, which can be punctuation, word fragments, or code snippets.
Embedding vectors form a semantic space where analogies like 'king - man + woman ≈ queen' emerge purely from statistical training, not explicit rules.
Self-attention is not human-like focus but a vector-similarity mechanism—Query and Key dot products determine relevance, and Value vectors carry the information to be aggregated.
The same token can have a completely different internal representation depending on its surrounding context, which is why polysemy is handled without separate dictionary entries.
Temperature is not just a creativity slider; it directly modifies the softmax distribution, and choosing it poorly can make output either incoherent or repetitive.
Every API call to an LLM is a loop of matrix multiplications over a fixed vocabulary, which explains why costs scale with both input length and output length.