The Next-Token Prediction Pipeline: From Raw Text to Transformer Output
Many people, when first hearing about large language models, feel that they seem to 'understand human language.' You ask it a question, it can answer; you ask it to write code, it can complete it; you ask it to write an article, it can continue writing.
But from the model's internal perspective, what an LLM does can first be simplified into one sentence:
Based on the existing context, predict the next most likely Token.
Note, it's not simply predicting the 'next word,' but the 'next Token.' A Token can be a word, a sub-word, a character, a punctuation mark, or even part of a word.
For example, if you input:
The capital of China is
The model internally calculates:
Beijing: 92%
Peiping: 4%
Chang'an: 2%
Shanghai: 0.5%
……
Then it selects a Token based on probability, such as 'Beijing.' Next, the model appends 'Beijing' to the original input and continues to predict the next Token.
This is what is called:
Autoregressive generation: generating Token by Token, one after another.
1. User Input Does Not Directly Enter the Model
User input is in natural language, for example:
I love artificial intelligence, natural language processing is very interesting
But the model cannot directly process Chinese, English, or punctuation. It first needs to split the text into Tokens.
This process is called:
Tokenization
That is, word segmentation, or more accurately, tokenization.
The smallest unit a large model processes is usually not a 'word' in the linguistic sense, but a Token.
For example, in English:
unhappiness
Might not be treated as a complete word, but split into:
un + happy + ness
Another example:
tokenization
Might be split into:
token + ization
Chinese is similar.
For example:
I love artificial intelligence, natural language processing is very interesting
In some tokenizers, it might be split into:
["I", "love", "artificial intelligence", ",", "natural language processing", "very", "interesting"]
Different models have different tokenizers, and the splitting method can also differ. For instance, OpenAI has its own tokenizer, and models like Qwen, LLaMA, and Claude also have their own.
2. Why Not Process by Complete Words?
Because there are too many complete words.
If a model only recognized complete English words, it would need to memorize hundreds of thousands of words. If you add Chinese, Japanese, Korean, code, symbols, emojis, and specialized terminology, the vocabulary would become enormous.
This brings several problems:
First, a huge vocabulary means a larger lookup table.
Second, more model parameters increase training and inference costs.
Third, the model struggles with new words, spelling variations, and specialized terms.
So a more reasonable approach is to break words into smaller 'building blocks.'
For example:
unhappiness = un + happy + ness
This way, the model doesn't need to memorize every complex word individually; it only needs to master a set of common sub-words to combine many expressions.
You can think of Tokens as the 'currency unit' in the LLM world.
We usually understand language by characters, words, and sentences, but models calculate, transmit, and charge by Tokens.
3. Tokens Are Converted into Token IDs
After splitting into Tokens, the model converts each Token into a numerical ID, known as a Token ID.
For example:
"你" -> 57668
"好" -> 53901
These numbers themselves have no semantics.
57668 does not represent the meaning of 'you,' nor can you mathematically derive the ID for 'good' from it.
It is just an index.
You can think of a Token ID as a page number in a dictionary:
Token: 你
Token ID: 57668
Meaning: The Token '你' has the number 57668 in the vocabulary.
But this number alone does not allow the model to understand meaning. What truly gives the model semantics is the next step: Embedding.
4. Embedding: Turning Token IDs into Semantic Vectors
Token IDs are just numerical identifiers and cannot directly express semantics.
So the model uses a massive lookup table to convert a Token ID into a high-dimensional vector.
This lookup table is called:
Embedding Matrix
For example:
Token ID 57668 -> [0.12, -0.44, 0.87, ..., 0.03]
This vector might have 768, 1024, 4096 dimensions, or even higher. The specific dimension depends on the model architecture.
This process is called:
Embedding
Simply put:
Embedding is the process of turning a discrete Token ID into a continuous, high-dimensional semantic coordinate.
You can imagine a huge semantic space. Each Token is a point in this space.
Tokens with similar semantics usually have vectors that are closer together.
For example:
King, Queen, Man, Woman
There might be a directional relationship in the vector space:
King - Man + Woman ≈ Queen
This is not a rule hand-written into the model, but a geometric structure the model gradually learns during training on massive text data.
Another example:
Cat, Dog, Animal
They might be closer together.
While:
King, Apple
Might be farther apart.
5. Embedding Knows 'What' It Is, Not 'Where' It Is
With semantic vectors, the model knows roughly what each Token means.
But that's not enough.
Because the meaning of a sentence depends not only on the words but also on their order.
For example:
I bit the dog
The dog bit me
These two sentences use almost the same Tokens, but their meanings are completely different.
If the model only looked at the semantic vectors of the Tokens, it wouldn't know which word came first and which came later.
So we need to add positional information.
This is:
Position Encoding
Also known as positional encoding.
The model adds positional information to the Embedding of each Token, letting it know the position of this Token in the sequence.
So before entering the Transformer, the representation of each Token roughly contains two types of information:
Semantic information: What this Token is
Positional information: Where this Token is
It can be understood as:
Final input vector = Token Embedding + Position Encoding
Of course, modern models have many variants of positional encoding implementations, such as absolute positional encoding, relative positional encoding, and RoPE (Rotary Position Embedding). But for beginners, understanding that 'the model needs to know the order' is enough.
6. The Transformer Truly Begins Processing Context
At this point, the input has been transformed from text into a sequence of vectors.
For example:
China 's capital is
Becomes:
[x1, x2, x3, x4]
Each x is a high-dimensional vector containing both semantic and positional information.
Next, these vectors enter the Transformer Block.
A large language model usually does not have just one Transformer layer, but many stacked together.
For example:
Layer 1 Transformer Block
Layer 2 Transformer Block
Layer 3 Transformer Block
……
Layer N Transformer Block
Each layer further processes the Token representations.
The two core parts inside a Transformer Block are:
Self-Attention
Feed-Forward Network
You can initially understand them as:
Self-Attention: Lets each Token look at the context and decide who it should pay attention to.
Feed-Forward: Further processes the representation of each Token.
7. Self-Attention: How the Model Understands Contextual Relationships
Self-Attention is the soul of the Transformer.
The problem it solves is:
Which Tokens in the context should the current Token focus on?
For example, in this sentence:
The animal didn’t cross the street because it was too tired.
Who does it refer to here?
Is it animal, or street?
Humans can easily judge:
it = animal
Because 'tired' usually describes an animal, not a street.
But how does the model calculate this?
This relies on Self-Attention.
8. Q, K, V: The Three Roles of Self-Attention
In Self-Attention, the vector of each Token is transformed into three vectors:
Q = Query
K = Key
V = Value
You can first understand them in a relatable way:
Query: What am I looking for?
Key: What kind of question can I be matched to?
Value: If someone pays attention to me, what information can I provide?
For instance, for it, its Query might be searching for:
What object does this pronoun refer to?
For animal, its Key might contain:
I am an entity that can act and get tired.
For street, its Key might contain:
I am a location.
Then the model uses the Query of it to calculate similarity with the Key of every Token in the context.
This similarity is usually done via dot product:
score = Q · K
For example:
score(it, animal) = Q_it · K_animal
score(it, street) = Q_it · K_street
If:
score(it, animal) > score(it, street)
The model will pay more attention to animal.
Then the model takes a weighted sum of the Values of all Tokens based on these attention scores.
That is to say, the new representation of it will absorb more information from animal.
This is why Self-Attention allows the model to handle references, context, and long-distance dependencies.
9. A More Intuitive Example: Apple Phone and I Ate an Apple
The same word can have completely different meanings in different contexts.
For example:
The Apple phone is very easy to use
I ate an apple
The first 'Apple' is closer to a company or brand.
The second 'apple' is closer to a fruit.
Looking at the Token 'apple' alone, its initial Embedding might be fixed.
But after Self-Attention, it will change its representation based on the context.
In:
The Apple phone is very easy to use
'Apple' will pay attention to words like 'phone' and 'easy to use,' so its representation leans more towards a brand or product.
In:
I ate an apple
'Apple' will pay attention to 'ate,' so its representation leans more towards a fruit.
This is contextual modeling.
The model does not just look up a fixed dictionary; it dynamically updates the meaning of each Token within the context.
10. Multi-Head Attention: Not Just One Way to Look at a Sentence
In a real Transformer, there is usually not just one Attention mechanism, but multiple running in parallel.
This is called:
Multi-Head Attention
Why are multiple heads needed?
Because there are many kinds of relationships in a sentence.
Some attention heads might focus on grammatical relationships:
Subject and predicate
Some attention heads might focus on referential relationships:
'it' pointing to 'animal'
Some attention heads might focus on positional relationships:
The previous word, the next word
Some attention heads might focus on semantic categories:
Location, person, action, object
So Multi-Head Attention can be understood as:
The model observes contextual relationships from multiple perspectives simultaneously.
Finally, the results from these different attention heads are concatenated and fused to form a new Token representation.
11. Feed-Forward: Deeper Processing for Each Token
After Self-Attention, each Token has absorbed contextual information.
Next, it enters the Feed-Forward Network.
It is usually a small neural network that processes the vector of each Token position independently.
You can roughly understand it as:
Self-Attention: Responsible for communicating with the context.
Feed-Forward: Responsible for deep processing of the current information.
For example, after Self-Attention, it has absorbed information from animal.
The Feed-Forward network will continue to process this representation, turning it into semantic features more suitable for the next layer.
The Transformer Block repeatedly executes a similar process:
Look at context
Process itself
Look at context again
Process itself again
……
The deeper the layers, the more abstract and richer the Token representations become.
12. After Many Transformer Layers, the Model Begins Predicting the Next Token
Assume the input is:
The capital of China is
After Tokenizer, Embedding, Position Encoding, and many Transformer layers, the model gets the hidden vector of the last position.
This hidden vector contains the information from the preceding context.
It roughly expresses:
Based on the context 'The capital of China is,' the next position should be a Token representing the name of a capital.
But it is not text yet.
The model still needs to convert this hidden vector into scores for the entire vocabulary.
This step is usually done through a linear layer.
13. Logits: Scoring Every Token in the Vocabulary
Assume the model's vocabulary size is 50,000.
Then the model outputs a 50,000-dimensional vector.
Each dimension corresponds to the raw score of a Token.
This score is called:
logit
For example:
Beijing: 12.8
Peiping: 8.1
Shanghai: 3.2
Chang'an: 2.9
Apple: -1.4
……
Note, a logit is not yet a probability.
It is just the raw score the model gives to each candidate Token.
The higher the score, the more suitable the model thinks this Token is for the next position.
14. Softmax: Turning Scores into Probabilities
To choose one from these candidate Tokens, the model needs to convert the logits into probabilities.
This step uses:
Softmax
Softmax converts all scores into a probability distribution, and all probabilities sum up to 1.
For example:
Beijing: 92%
Peiping: 4%
Chang'an: 2%
Shanghai: 0.5%
Others: 1.5%
Now the model knows:
The next Token is most likely 'Beijing'
15. Sampling Strategies: Why Doesn't the Model Always Pick the Highest Probability Word?
The simplest way is:
Always choose the Token with the highest probability.
This is called greedy decoding.
For example:
Beijing: 92%
Peiping: 4%
Chang'an: 2%
Then directly choose 'Beijing'.
This method is stable, but sometimes makes the output rigid.
So many models use sampling strategies.
Common parameters include:
temperature
top-k
top-p
A simple understanding:
Low temperature: More conservative, more deterministic.
High temperature: More random, more creative.
For example, for writing code or doing math problems, you usually want a lower temperature.
For writing novels, brainstorming, or ad copy, the temperature can be higher.
16. Decode: Turning Token IDs Back into Text
After the model selects the next Token, what it gets might still be a Token ID.
For example:
Token ID: 12345
Then the tokenizer's decode process converts it back to text:
12345 -> Beijing
So the output becomes:
The capital of China is Beijing
Next, the model appends 'Beijing' to the context and continues to predict the next Token.
That is:
The capital of China is Beijing
This is fed back into the model to predict the next one:
.
Thus:
The capital of China is Beijing.
This process repeats until the model generates an end-of-sequence token or reaches the maximum output length.
17. Complete Process Summary
Now let's string the whole process together.
User input:
The capital of China is
Step 1, Tokenizer splits into Tokens:
["China", "'s", "capital", "is"]
Step 2, Tokens converted to Token IDs:
[101, 234, 8976, 322]
Step 3, Token IDs look up the Embedding Matrix to get semantic vectors:
[Vector1, Vector2, Vector3, Vector4]
Step 4, add Position Encoding:
[Semantic + Position]
Step 5, enter multiple Transformer Blocks:
Self-Attention: Models contextual relationships
Feed-Forward: Deeply processes semantic representations
Step 6, the last layer outputs hidden states.
Step 7, a linear layer maps to the entire vocabulary:
Logits for 50,000 Tokens
Step 8, Softmax converts to probabilities:
Beijing 92%, Peiping 4%, Chang'an 2%……
Step 9, select the next Token based on a sampling strategy:
Beijing
Step 10, Decode into text, append to context:
The capital of China is Beijing
Then the loop continues.
18. A One-Sentence Version for Beginners
The workflow of an LLM can be understood as:
Text input
↓
Split into Tokens
↓
Tokens become numerical IDs
↓
IDs look up a table to become semantic vectors
↓
Add positional information
↓
Transformer understands context via Self-Attention
↓
Predict the probability of the next Token
↓
Select a Token
↓
Decode into text
↓
Append to context, continue predicting
It can also be shorter:
An LLM does not write a complete answer all at once, but constantly predicts the next Token based on the preceding text.
19. A Technical Summary for Readers with a Background
From a technical perspective, the inference process of a Decoder-only Transformer is roughly:
input text
→ tokenizer
→ token ids
→ token embeddings + positional encoding
→ stacked decoder transformer blocks
→ final hidden states
→ linear projection to vocab size
→ logits
→ softmax / sampling
→ next token id
→ decode
→ append and repeat
In each Transformer layer, the core computation includes:
Q = XWq
K = XWk
V = XWv
Attention(Q, K, V) = softmax(QKᵀ / √d_k) V
Where:
Q: What information the current Token wants to find
K: How each Token can be matched
V: The actual information content each Token provides
Self-Attention allows each Token to re-update its representation based on the context.
After stacking multiple layers, the model's hidden state contains rich contextual semantic information.
Finally, it maps to the vocabulary dimension through the output layer, gets the logit for each candidate Token, and selects the next Token via softmax or a sampling strategy.
20. A Few Points Easily Misunderstood
1. An LLM does not directly 'understand text'
It processes Token IDs and vectors.
Natural language is just the form at the input and output levels.
The model's internals are mainly doing matrix calculations, vector transformations, and probability predictions.
2. A Token is not equal to a word
A Token can be:
A character
A word
A sub-word
A punctuation mark
A space
A code snippet
So saying 'predict the next word' is just for easy understanding; the more accurate statement is:
Predict the next Token.
3. Token IDs themselves have no semantics
For example:
"你" -> 57668
This 57668 is just a number; it does not represent the meaning of 'you'.
What truly contains semantics is:
The Embedding vector
4. Embeddings are not hand-written dictionary definitions
Embeddings are not manually defined by humans as:
Apple = fruit / company
They are high-dimensional spatial structures trained by the model from massive text data.
5. Self-Attention is not as simple as 'focusing attention'
It is not attention in the human visual sense, but a vector matching mechanism.
The essence is:
Use Query and Key to calculate relevance
Then use the relevance to weight the Value
Conclusion: The Essence of an LLM is a Contextual Probability Machine
From the outside, an LLM seems to chat, write, reason, and program.
But from the inside, its fundamental action is very uniform:
Turn context into vector representations, then predict the probability distribution of the next Token.
The reason it appears to 'understand language' is that during large-scale training, it has learned the complex statistical relationships between language, knowledge, grammar, semantics, code, and reasoning patterns.
So, the core of an LLM is not simply memorizing answers, but compressing context into semantic representations within a massive parameter space, and then generating the most suitable next Token step by step.
One sentence summary:
Tokens are the input units, Embeddings are semantic coordinates, Position Encoding provides order, Self-Attention establishes contextual relationships, the Transformer processes information, Softmax gives probabilities, a sampling strategy selects the next Token, and finally, continuous looping generates a complete answer.