Subword Units

What It Is

Subword units are token pieces that sit between whole words and individual characters — fragments like “un-”, “-ing”, “##est”, or “low”. They’re the vocabulary items produced by tokenization algorithms like BPE, WordPiece, and SentencePiece.

Why It Matters

Whole-word vocabularies fail on rare words and morphologically rich languages: “running”, “runner”, “ran” are three separate entries sharing nothing. Character-level models handle any word but produce extremely long sequences. Subword units hit the sweet spot: common words get their own token, rare words decompose into known pieces, and the vocabulary stays bounded.

How It Works

Subword tokenizers discover units by statistics, not linguistics. BPE merges the most frequent character pairs. WordPiece merges the pair that maximally increases likelihood. SentencePiece’s unigram model prunes a large initial vocabulary down to the target size. None require a morphological analyzer — they find the structure that the data actually has.

Key Sources

bpe-neural-machine-translation-subword-units

tokenization — the process that produces subword units
vocabulary — the full set of subword types the model knows
compression — BPE’s algorithmic origin

ML Wiki

Explorer

Subword Units

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Subword Units

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks