Vocabulary

What It Is

A model’s vocabulary is the finite set of tokens it can recognize and generate. Each token maps to an integer ID, which maps to an embedding vector. Vocabulary size is a fundamental architectural choice: GPT-2 used 50,257 tokens; LLaMA 2 uses 32,000; GPT-4’s tiktoken vocabulary has ~100,000.

Why It Matters

Vocabulary size controls a tradeoff. Larger vocabularies mean shorter sequences (faster inference, more content per context window) and dedicated IDs for common words (cleaner embeddings). But they require more embedding parameters and hurt cross-lingual transfer — different languages need to share a fixed budget. Smaller vocabularies extend to any language but produce longer sequences and force more compositional representation.

How It Works

Modern tokenizers define vocabularies via BPE or a variant: start from characters, run N merge operations, stop at the target vocabulary size. The N determines what fraction of words get their own token vs. decomposing into pieces. The vocabulary is fixed at training time — inference always uses the same vocabulary the model was trained with.

Key Sources

bpe-neural-machine-translation-subword-units

tokenization — how vocabularies are built
subword-units — what vocabularies contain
in-context-learning — vocabulary size affects how much fits in a context window

ML Wiki

Explorer

Vocabulary

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Vocabulary

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks