What It Is
Subword units are token pieces that sit between whole words and individual characters — fragments like “un-”, “-ing”, “##est”, or “low”. They’re the vocabulary items produced by tokenization algorithms like BPE, WordPiece, and SentencePiece.
Why It Matters
Whole-word vocabularies fail on rare words and morphologically rich languages: “running”, “runner”, “ran” are three separate entries sharing nothing. Character-level models handle any word but produce extremely long sequences. Subword units hit the sweet spot: common words get their own token, rare words decompose into known pieces, and the vocabulary stays bounded.
How It Works
Subword tokenizers discover units by statistics, not linguistics. BPE merges the most frequent character pairs. WordPiece merges the pair that maximally increases likelihood. SentencePiece’s unigram model prunes a large initial vocabulary down to the target size. None require a morphological analyzer — they find the structure that the data actually has.
Key Sources
Related Concepts
- tokenization — the process that produces subword units
- vocabulary — the full set of subword types the model knows
- compression — BPE’s algorithmic origin