What It Is
Tokenization converts raw text into discrete units (tokens) that a model can process. A token might be a word, a subword fragment, a character, or a byte — the choice defines what the model’s vocabulary looks like and how it handles rare or unseen words.
Why It Matters
Language models don’t read text directly; they read sequences of integer IDs. The tokenizer decides what those IDs represent. Get it wrong and rare words become <UNK> black holes, non-Latin scripts get mangled, and your context window fills with single-character tokens. The tokenizer is the invisible layer between the world and the model.
How It Works
Modern LLMs use Byte Pair Encoding (BPE) or a close variant. BPE starts with individual characters, then iteratively merges the most frequent adjacent pair into a new symbol, repeated N times (typically 10k–100k merges). The result is a vocabulary where common words get their own token and rare words decompose into familiar subword pieces. No word is ever truly unknown — worst case it falls back to individual characters or bytes.
Key Sources
Related Concepts
- subword-units — the pieces BPE produces
- vocabulary — the set of all tokens a model knows
- compression — BPE repurposes a compression algorithm for word segmentation