BPE: Neural Machine Translation of Rare Words with Subword Units

Concepts: tokenization | subword-units | vocabulary | compression Builds on: (no direct prerequisite in this wiki — BPE compression algorithm originates in Gage 1994) Leads to: language-models-are-unsupervised-multitask-learners | attention-is-all-you-need | bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding

Every time GPT-4 tokenizes your message, it runs an algorithm invented for data compression in 1994 and repurposed for language models in this paper. The idea is almost embarrassingly simple. It works on every language in the world. And for a decade, no one found anything fundamentally better.

The problem

Neural machine translation models need a fixed vocabulary. You pick the 30,000 most common words, assign each an ID, and your model only knows those words. Anything else — a rare German compound, a Russian verb form, a person’s name — maps to a single <UNK> token and the model has no idea what to do with it.

“Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem.”

The standard workaround was dictionary back-off: after translation, scan for <UNK> tokens and look up the source word in a bilingual dictionary. This is clunky, misses morphological variants, and fails entirely on names and rare compounds. It’s treating a symptom, not the disease.

The core idea

The analogy: A child who has never seen the word “unbreakable” can still puzzle it out. They recognize “un-” as negation, “break” as a familiar verb, “-able” as “capable of.” The word is new; its pieces aren’t. They don’t need a dictionary entry for “unbreakable” because they can reconstruct its meaning from components they already know.

BPE does the same thing for a translation model. Instead of a vocabulary of whole words, build a vocabulary of subword units — pieces that appear frequently enough across the corpus to be worth knowing. Rare words get represented as sequences of familiar pieces. Common words get their own token. Names get transliterated character by character. No word is ever truly unknown.

The mechanism:

BPE starts from a 1994 data compression paper by Philip Gage. The original algorithm compressed text by replacing the most frequent pair of bytes with a single unused byte. Sennrich et al. realized that applied to characters rather than bytes, this naturally discovers meaningful subword units.

Here’s the algorithm:

Initialize your vocabulary as individual characters plus a special end-of-word marker </w>. Every word in your training corpus is represented as a sequence of characters.
Count the frequency of every adjacent character pair across the entire corpus.
Find the most frequent pair.
Merge that pair into a new symbol everywhere it appears.
Repeat from step 2 until you’ve done N merge operations (typically 10,000–60,000).

The result is a vocabulary of variable-length subwords. Frequent patterns get merged early and become whole tokens. Rare patterns stay as sequences of smaller pieces.

“The main difference to other compression algorithms is that we do not compress pairs of bytes, but pairs of characters (or character sequences). We do not aim to achieve the minimal representation that would be optimal for compression, but instead aim to find a good vocabulary for a given vocabulary size.”

BEFORE BPE (character-level):
"lowest"  →  l  o  w  e  s  t  </w>      (7 tokens — too slow)
"running" →  r  u  n  n  i  n  g  </w>   (8 tokens)

AFTER BPE (word-level):
"lowest"  →  [lowest]                     (1 token — but OOV on rare words)
"running" →  [running]                    (1 token — but OOV for inflections)

AFTER BPE SUBWORD (10k–60k merges):
"lowest"  →  [low]  [est</w>]            (2 tokens — familiar pieces)
"running" →  [run]  [n]  [ing</w>]       (3 tokens — morpheme-aligned)
"Bundesrepublik" → [Bund] [es] [rep] [ublik</w>]  (compound decomposed)
"Гуглом" (Russian: "by Google") → [Гу] [гл] [ом</w>]  (transliterated pieces)

Rare word → familiar subwords → translatable

Walkthrough with real numbers:

Let’s trace BPE from scratch on the paper’s own example vocabulary.

Initial state — the corpus contains four word types with their frequencies:

Corpus (word → character sequence):
  l o w </w>         (appears 5 times)
  l o w e r </w>     (appears 2 times)
  n e w e s t </w>   (appears 6 times)
  w i d e s t </w>   (appears 3 times)

Count every adjacent pair across all occurrences:

Pair counts:
  (e, s): 6 + 3 = 9   ← from "newest" and "widest"
  (s, t): 6 + 3 = 9
  (t, </w>): 6 + 3 = 9
  (l, o): 5 + 2 = 7
  (o, w): 5 + 2 = 7
  (w, e): 6 + 2 = 8   ← from "newest" (position 3–4) and "lower"
  (n, e): 6
  (e, w): 6
  (w, i): 3
  ...

Merge 1: Pick (e, s) → es (count 9, wins the three-way tie):

  l o w </w>          (5)
  l o w e r </w>      (2)
  n e w es t </w>     (6)
  w i d es t </w>     (3)

Merge 2: Now (es, t) = 9 — merge to est:

  l o w </w>          (5)
  l o w e r </w>      (2)
  n e w est </w>      (6)
  w i d est </w>      (3)

Merge 3: (est, </w>) = 9 — merge to est</w>:

  l o w </w>          (5)
  l o w e r </w>      (2)
  n e w est</w>       (6)
  w i d est</w>       (3)

Merge 4: (l, o) = 7 — merge to lo:

  lo w </w>           (5)
  lo w e r </w>       (2)
  n e w est</w>       (6)
  w i d est</w>       (3)

Merge 5: (lo, w) = 7 — merge to low:

  low </w>            (5)
  low e r </w>        (2)
  n e w est</w>       (6)
  w i d est</w>       (3)

After 5 merges: the subword units low, est</w> (meaning “est” at end of word) have been discovered. Now apply these merges to a new word at inference time:

"lowest" (not seen in training) → apply merges in order:
  Start:     l o w e s t </w>
  Merge es:  l o w es t </w>     ← (e,s) was merge #1
  Merge est: l o w est </w>      ← (es,t) was merge #2
  Merge est</w>: l o w est</w>   ← merge #3
  Merge lo:  lo w est</w>        ← merge #4
  Merge low: low est</w>         ← merge #5
  
  Final tokens: ["low", "est</w>"]
  
"newer" (not seen in training):
  Start:     n e w e r </w>
  After all merges: n e w e r </w>   ← none of our 5 merges apply
  
  Final tokens: ["n", "e", "w", "e", "r", "</w>"]

“lowest” gets 2 tokens (compact, translatable). “newer” falls back to characters (never truly unknown, just less efficient). With 60,000 merges instead of 5, nearly everything common gets a dedicated token and nearly everything rare still decomposes into recognizable pieces.

What’s clever:

The insight isn’t the BPE algorithm itself — that’s 1994 compression theory. The insight is the reuse. Data compression and word segmentation look like completely different problems: one wants the shortest code, the other wants the most linguistically meaningful units. But they share a deeper structure: both benefit from identifying recurrent patterns.

Language is redundant in a structured way. The suffix “-ing” appears constantly. The prefix “un-” appears constantly. German compounds repeat their constituent parts. Russian verb stems appear across dozens of inflected forms. BPE’s frequency-based merging naturally discovers these patterns because they’re statistically dominant — not because anyone told it what morphemes are.

“We propose a simple and effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).”

The other key insight: they introduce joint BPE, applying the algorithm to the concatenated source and target vocabularies. When English “est” and German “est” share a merge rule, the model can learn cross-lingual morphological mappings. Related languages can literally share subword tokens.

Does it work? What breaks?

System	EN→DE BLEU	EN→RU BLEU	vs. Back-off Dict
Word model + dictionary back-off	baseline	baseline	—
Character n-gram (C2-50)	+0.5	+0.4	slower at inference
BPE (10k merges)	+0.9	+1.1	faster than char
BPE (60k merges)	+1.1	+1.3	best speed/accuracy

Best BPE variant improves over the back-off dictionary by +1.1 BLEU (English→German, WMT 2015) and +1.3 BLEU (English→Russian) — measured on news-test2015. Joint BPE on related language pairs (same script, shared morphology) does better than separate BPE.

What doesn’t work:

BPE’s vocabulary size is a hyperparameter with no principled choice. Too few merges: every word is split into many pieces, sequences get long, the decoder struggles. Too many merges: you’ve essentially rebuilt a word vocabulary and rare words are unknown again.

BPE isn’t actually morphologically aware — it’s statistically greedy. It will merge “the” + “re” → “there” before it merges “un” + “happy” if “there” appears more often. WordPiece (Google’s variant, used in BERT) addresses this by maximizing likelihood instead of frequency, but the philosophical difference is small in practice.

The algorithm is also byte-order sensitive. Different tokenizers on the same text can produce wildly different splits. GPT-2’s byte-level BPE (operating on raw bytes, not Unicode characters) sidesteps encoding issues at the cost of longer sequences for non-Latin scripts.

So what?

If you’re training or fine-tuning any language model today, you are using BPE or a direct descendant. GPT-2, GPT-3, GPT-4 use BPE (via tiktoken for GPT-3.5+). LLaMA, Mistral, and the entire Llama family use SentencePiece BPE. T5 uses SentencePiece with a unigram language model variant. BERT uses WordPiece — same core idea, different merge criterion. The vocabulary size you see in model cards (32k, 50k, 100k tokens) is the N in “how many BPE merges did you run.”

This means vocabulary size is a real design choice. Larger vocabularies mean more tokens have dedicated IDs (faster, shorter sequences for common text) but require more embedding parameters and hurt cross-lingual transfer. Smaller vocabularies mean shorter embedding tables but longer sequences and more compositional pressure on the model. Most LLM practitioners set 32k–100k and don’t revisit it. That’s probably worth revisiting.

The paper this connects most directly to is the Transformer (attention-is-all-you-need): transformers need a tokenizer, and every transformer paper since 2017 silently inherits this one’s design. When GPT-2 (language-models-are-unsupervised-multitask-learners) claimed to do zero-shot translation, the BPE tokenizer was doing some of that work — shared subwords across languages let the model discover cross-lingual patterns without explicit alignment supervision.

BPE: take a 1994 compression algorithm, apply it to characters instead of bytes, and accidentally solve the vocabulary problem that had been blocking production NMT for years.

Paper: Neural Machine Translation of Rare Words with Subword Units — Sennrich, Haddow, Birch — ACL 2016

Connections

tokenization — the process BPE defines; every modern LLM tokenizer is a variant
subword-units — the core output of BPE: pieces between characters and words
vocabulary — BPE defines what the vocabulary is and how large it can be
compression — BPE’s algorithm is repurposed from data compression (Gage 1994)
language-models-are-unsupervised-multitask-learners — GPT-2 uses BPE on byte sequences
attention-is-all-you-need — transformers rely on BPE tokenization
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — BERT uses WordPiece, a BPE variant

Citation

arXiv:1508.07909

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1715–1725. https://arxiv.org/abs/1508.07909

ML Wiki

Explorer