What It Is

A model’s vocabulary is the finite set of tokens it can recognize and generate. Each token maps to an integer ID, which maps to an embedding vector. Vocabulary size is a fundamental architectural choice: GPT-2 used 50,257 tokens; LLaMA 2 uses 32,000; GPT-4’s tiktoken vocabulary has ~100,000.

Why It Matters

Vocabulary size controls a tradeoff. Larger vocabularies mean shorter sequences (faster inference, more content per context window) and dedicated IDs for common words (cleaner embeddings). But they require more embedding parameters and hurt cross-lingual transfer — different languages need to share a fixed budget. Smaller vocabularies extend to any language but produce longer sequences and force more compositional representation.

How It Works

Modern tokenizers define vocabularies via BPE or a variant: start from characters, run N merge operations, stop at the target vocabulary size. The N determines what fraction of words get their own token vs. decomposing into pieces. The vocabulary is fixed at training time — inference always uses the same vocabulary the model was trained with.

Key Sources