What It Is

Tokenization converts raw text into discrete units (tokens) that a model can process. A token might be a word, a subword fragment, a character, or a byte — the choice defines what the model’s vocabulary looks like and how it handles rare or unseen words.

Why It Matters

Language models don’t read text directly; they read sequences of integer IDs. The tokenizer decides what those IDs represent. Get it wrong and rare words become <UNK> black holes, non-Latin scripts get mangled, and your context window fills with single-character tokens. The tokenizer is the invisible layer between the world and the model.

How It Works

Modern LLMs use Byte Pair Encoding (BPE) or a close variant. BPE starts with individual characters, then iteratively merges the most frequent adjacent pair into a new symbol, repeated N times (typically 10k–100k merges). The result is a vocabulary where common words get their own token and rare words decompose into familiar subword pieces. No word is ever truly unknown — worst case it falls back to individual characters or bytes.

Key Sources

  • subword-units — the pieces BPE produces
  • vocabulary — the set of all tokens a model knows
  • compression — BPE repurposes a compression algorithm for word segmentation