Artificial Intelligence

The Next-Token Prediction Pipeline: From Raw Text to Transformer Output

By 东风破_ · Jul 2, 2026

Read original on juejin.cn ↗ Google Translate ↗ Alt translation

Demystifying the token-to-token pipeline replaces the illusion of 'understanding' with a concrete mental model of matrix operations and probability. That clarity helps developers debug generation, choose sampling parameters, and reason about costs, latency, and failure modes.

Summary

An LLM's core loop is deceptively simple: given a sequence of tokens, predict the most likely next token. Under the hood, raw text is tokenized into sub-word units and converted to numeric IDs. Those IDs are then mapped to high-dimensional embedding vectors that capture semantic meaning, and positional encodings are added so the model knows word order. The resulting vectors flow through many Transformer blocks, where self-attention layers let each token weigh the relevance of every other token in the context, and feed-forward layers refine the representation. The final hidden state is projected across the entire vocabulary to produce logits, which a softmax turns into a probability distribution. A sampling strategy picks the next token, the output is decoded back to text, and the cycle repeats until a stop condition is met. Every step—from tokenization to decoding—is a deterministic pipeline of matrix math and probability, not magic.

Takeaways

— An LLM does not write a full answer at once; it repeatedly predicts the next token based on the preceding context.

— Tokenization splits text into sub-word units, not full words, to keep vocabulary sizes manageable and handle rare terms.

— Token IDs are just indices; semantic meaning comes from embedding vectors learned during training.

— Positional encoding is added to embeddings so the model can distinguish word order, not just word identity.

— Self-attention computes how much each token should attend to every other token by comparing Query and Key vectors, then aggregates Value vectors accordingly.

— Multi-head attention runs several attention operations in parallel, allowing the model to capture different relationship types simultaneously.

— Feed-forward networks process each token's representation independently after attention, deepening the semantic features.

— After the final Transformer layer, a linear projection maps the hidden state to logits for every token in the vocabulary.

— Softmax converts raw logits into a probability distribution, and sampling strategies like temperature, top-k, and top-p control randomness.

— The selected token ID is decoded back into text, appended to the input, and the whole process repeats autoregressively.

Conclusions

Calling it 'next-word prediction' is a misnomer that hides the sub-word tokenization step; the model predicts tokens, which can be punctuation, word fragments, or code snippets.

Embedding vectors form a semantic space where analogies like 'king - man + woman ≈ queen' emerge purely from statistical training, not explicit rules.

Self-attention is not human-like focus but a vector-similarity mechanism—Query and Key dot products determine relevance, and Value vectors carry the information to be aggregated.

The same token can have a completely different internal representation depending on its surrounding context, which is why polysemy is handled without separate dictionary entries.

Temperature is not just a creativity slider; it directly modifies the softmax distribution, and choosing it poorly can make output either incoherent or repetitive.

Every API call to an LLM is a loop of matrix multiplications over a fixed vocabulary, which explains why costs scale with both input length and output length.

Concepts & terms

Tokenization

The process of splitting raw text into smaller units called tokens, which can be words, sub-words, characters, or punctuation. It reduces vocabulary size and handles out-of-vocabulary terms.

Embedding

A learned mapping from a discrete token ID to a dense, high-dimensional vector that captures semantic meaning. Similar tokens end up close together in this vector space.

Positional Encoding

A technique that adds information about a token's position in a sequence to its embedding, allowing the model to distinguish word order since self-attention is otherwise permutation-invariant.

Self-Attention

A mechanism where each token computes Query, Key, and Value vectors. It determines how much to focus on other tokens by comparing Queries and Keys, then aggregates their Values.

Multi-Head Attention

Running multiple self-attention operations in parallel with different learned projections, enabling the model to capture various types of relationships simultaneously.

Feed-Forward Network

A small neural network applied independently to each token's representation after attention, further processing the contextualized features.

Logits

The raw, unnormalized scores output by the model for each token in the vocabulary before being converted to probabilities.

Softmax

A function that converts a vector of raw scores into a probability distribution where all values sum to 1, making token selection possible.

Autoregressive Generation

A generation mode where the model predicts one token, appends it to the input, and feeds the extended sequence back in to predict the next token, repeating until completion.

Source: juejin.cn ↗ Google Translate ↗ Backup ↗