What It Is
A masked language model (MLM) is a pre-training objective where a fraction of input tokens are hidden (replaced with a [MASK] token or random token), and the model must predict the original tokens from the surrounding context. BERT introduced MLM as the mechanism for enabling deep bidirectional pre-training.
Why It Matters
MLM is what makes deep bidirectionality possible. Standard autoregressive language models can only predict the next token, which means training is inherently left-to-right. By masking tokens and asking the model to predict them from both sides simultaneously, MLM removes that directional constraint. The result is representations that incorporate the full sentence context at every layer, producing richer encodings for tasks like classification, question answering, and named entity recognition.
How It Works
BERT masks 15% of WordPiece tokens per sequence. Of those selected:
- 80% are replaced with [MASK]
- 10% are replaced with a random vocabulary token
- 10% are left unchanged
The model predicts the original token at each masked position using a softmax over the full vocabulary. The 80/10/10 split is a practical compromise: always using [MASK] creates a mismatch between pre-training and fine-tuning (since [MASK] never appears at fine-tuning time), so random and unchanged tokens teach the model to maintain contextual representations for all tokens.
The loss at each masked position is cross-entropy against the original token. Only masked positions contribute to the pre-training gradient.
Key Sources
- bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding
- bart-denoising-sequence-to-sequence-pre-training
- language-models-are-unsupervised-multitask-learners
- mae-masked-autoencoders-scalable-vision-learners