What Is a Token? Two Analogies That Make It Click
Token consumption directly drives API costs and limits what fits in a model's context window. Understanding tokenization—especially how different languages and text structures consume tokens unevenly—is essential for building efficient, cost-effective AI applications. These analogies make the abstract mechanics of BPE and context windows immediately actionable.
Tokens are the smallest units AI models process, but their behavior is unintuitive: a single English word can be split into multiple tokens, while punctuation counts as one. This post uses two vivid analogies to make the concept concrete.
In the eating analogy, each bite is a token, the meal is the input, and the stomach is the context window. A small tomato ("bug") takes one bite; a giant burger ("fix the bug") takes three; a hard-shell crab (Chinese text) takes many. The stomach can only hold so much—exceed it, and the AI "vomits" earlier information, causing apparent amnesia.
The skewer analogy reframes the same idea: the bamboo stick is the context window, each piece of meat is a token. Small pieces (simple English) pack many per stick; larger pieces (complex words or non-English scripts like Chinese) take up more space. Wasteful behaviors—like padding with polite phrases, demanding long outputs, or mixing languages—are like loading a skewer with fat instead of lean meat.
Practical tips emerge naturally: batch inputs, state requirements clearly upfront, avoid repeated regeneration, and start fresh conversations when context runs full. The underlying algorithm is Byte Pair Encoding (BPE), which determines how text gets split into tokens.
The eating and skewer analogies reveal that token efficiency is fundamentally about information density—'lean meat' over 'fat'—not just prompt length.
The fact that Chinese text consumes significantly more tokens than English per unit of meaning has real cost implications for developers building multilingual applications.
Context window limits are not just a technical constraint but a design constraint: they force developers to think about information prioritization and conversation structure.
The advice to 'start a new conversation' when context runs full is a practical workaround, but it also highlights a fundamental limitation of current transformer architectures.