Tokenization

What It Is

Tokenization converts raw text into discrete units (tokens) that a model can process. A token might be a word, a subword fragment, a character, or a byte — the choice defines what the model’s vocabulary looks like and how it handles rare or unseen words.

Why It Matters

Language models don’t read text directly; they read sequences of integer IDs. The tokenizer decides what those IDs represent. Get it wrong and rare words become <UNK> black holes, non-Latin scripts get mangled, and your context window fills with single-character tokens. The tokenizer is the invisible layer between the world and the model.

How It Works

Modern LLMs use Byte Pair Encoding (BPE) or a close variant. BPE starts with individual characters, then iteratively merges the most frequent adjacent pair into a new symbol, repeated N times (typically 10k–100k merges). The result is a vocabulary where common words get their own token and rare words decompose into familiar subword pieces. No word is ever truly unknown — worst case it falls back to individual characters or bytes.

Key Sources

bpe-neural-machine-translation-subword-units

subword-units — the pieces BPE produces
vocabulary — the set of all tokens a model knows
compression — BPE repurposes a compression algorithm for word segmentation

ML Wiki

Explorer

Tokenization

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Tokenization

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks