Masked Language Model

What It Is

A masked language model (MLM) is a pre-training objective where a fraction of input tokens are hidden (replaced with a [MASK] token or random token), and the model must predict the original tokens from the surrounding context. BERT introduced MLM as the mechanism for enabling deep bidirectional pre-training.

Why It Matters

MLM is what makes deep bidirectionality possible. Standard autoregressive language models can only predict the next token, which means training is inherently left-to-right. By masking tokens and asking the model to predict them from both sides simultaneously, MLM removes that directional constraint. The result is representations that incorporate the full sentence context at every layer, producing richer encodings for tasks like classification, question answering, and named entity recognition.

How It Works

BERT masks 15% of WordPiece tokens per sequence. Of those selected:

80% are replaced with [MASK]
10% are replaced with a random vocabulary token
10% are left unchanged

The model predicts the original token at each masked position using a softmax over the full vocabulary. The 80/10/10 split is a practical compromise: always using [MASK] creates a mismatch between pre-training and fine-tuning (since [MASK] never appears at fine-tuning time), so random and unchanged tokens teach the model to maintain contextual representations for all tokens.

The loss at each masked position is cross-entropy against the original token. Only masked positions contribute to the pre-training gradient.

ML Wiki

Explorer

Masked Language Model

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Masked Language Model

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks