BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Concepts: attention | transfer learning | masked language model | fine-tuning Builds on: Attention Is All You Need Leads to: LoRA (fine-tuning methods assume BERT-style pre-trained models) | InstructGPT (RLHF builds on the pre-train then fine-tune paradigm BERT established)

Part 1: The Problem

Before BERT, every NLP task was essentially a solo project. You had a dataset for sentiment analysis, a dataset for question answering, a dataset for named entity recognition — and you trained a separate architecture from scratch for each one. The best models used left-to-right language models (GPT) or shallow bidirectional tricks (ELMo) for pre-training, but both approaches left something on the table: neither could look at a word and simultaneously reason about everything to its left and everything to its right at every layer of the network.

The key insight the paper makes is surgical: “The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training.” In question answering, when you read “What city did the defendant flee to?”, every word needs to absorb context from both directions to mean anything. A left-to-right model reading “defendant” has only seen “What city did the” — it doesn’t yet know it’s talking about fleeing. That’s structural impoverishment baked into the training objective.

BERT fixes this with one trick: hide the word before training begins.

Part 2: The Mechanism

The Analogy

Think about how you learned language. You filled in blanks. “The cat sat on the ___.” You didn’t work left to right predicting the next word — you absorbed the whole sentence and guessed what fit. That’s a Cloze test, and it’s exactly what BERT trains on.

The cleverness is that to fill in a blank well, you have to understand syntax, semantics, world knowledge, and discourse structure all at once. Every blank becomes a tiny final exam on everything the model knows about language. Train on enough of them and the representations that emerge capture essentially all of that.

The Mechanism in Detail

BERT has two phases: pre-training and fine-tuning. In pre-training, the model sees massive amounts of raw text (BooksCorpus, 800M words; English Wikipedia, 2,500M words) and trains on two self-supervised tasks. In fine-tuning, the pre-trained weights initialize a task-specific model that trains end-to-end on labeled data with a single new output layer.

The architecture is a standard Transformer encoder stack. BERT-Base: 12 layers, hidden size 768, 12 attention heads, 110M parameters. BERT-Large: 24 layers, hidden size 1024, 16 attention heads, 340M parameters. Every token can attend to every other token in every layer. No masking of future tokens, no autoregressive constraint. Just full bidirectional attention.

The input representation sums three learned embeddings: a token embedding (WordPiece, 30,000 vocabulary), a segment embedding (is this token in sentence A or sentence B?), and a position embedding. Every sequence starts with a special [CLS] token. Sentence pairs are joined with a [SEP] separator.

Pre-training Task 1 is the Masked LM. The paper describes it directly: “In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.” Specifically, 15% of tokens are selected for prediction. Of those selected, 80% get replaced with [MASK], 10% get replaced with a random token, and 10% stay unchanged. The 80/10/10 split handles a practical problem: [MASK] never appears during fine-tuning, so training entirely on [MASK] would create a distribution mismatch. The 10% random and 10% unchanged cases teach the model to maintain representations even for unmasked tokens.

Pre-training Task 2 is Next Sentence Prediction (NSP). Many tasks — question answering, natural language inference — require understanding the relationship between two sentences. So BERT trains on pairs: given sentences A and B, predict whether B actually follows A in the corpus (IsNext) or is a random sentence (NotNext). The split is 50/50. “When choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).”

Fine-tuning is where the pre-trained model earns its value. The paper states: “Fine-tuning is straightforward since the self-attention mechanism in the Transformer allows BERT to model many downstream tasks — whether they involve single text or text pairs — by swapping out the appropriate inputs and outputs.” For classification tasks, the final hidden state of the [CLS] token flows into a classification head. For token-level tasks (NER, question answering), the per-token final hidden states are used directly.

The [CLS] token is a design decision worth pausing on. The paper introduces it as: “The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.” By putting [CLS] at position 0 and training NSP on its representation, the model learns to pack a whole-sequence summary into one vector.

ASCII Diagram

Pre-training: Masked Language Modeling
------------------------------------------------------------

Input:     [CLS] The  cat  [MASK] on  the   mat  [SEP]
                           ↑ 15% of tokens selected
                           ↑ 80% replaced with [MASK]

Encoder:   12 layers of bidirectional self-attention
           (every token can attend to every other token)

           [CLS] ←→ The ←→ cat ←→ [MASK] ←→ on ←→ the ←→ mat
                           ↑ attends left AND right
                           can see "cat" AND "mat"

Output:    Hidden state at [MASK] position
           → linear layer → softmax over 30,000 vocab
           → predict: "sat"  ✓

Pre-training: Next Sentence Prediction
------------------------------------------------------------

Input:     [CLS] She bought a ticket [SEP] She boarded the train [SEP]
           or
           [CLS] She bought a ticket [SEP] Pandas eat bamboo [SEP]

Output:    C (final hidden state of [CLS])
           → binary classifier → IsNext / NotNext

Fine-tuning: Sentiment Classification
------------------------------------------------------------

Pre-trained BERT weights (frozen init, then all updated)

Input:     [CLS] This movie was great [SEP]

Output:    C → W (K × 768) → softmax → {Positive, Negative}
           Only W is new. Everything else: pre-trained.

The Math

The core of each attention layer is standard scaled dot-product attention:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

For BERT-Base, $d_{k} = 768/12 = 64$ per head. Each layer runs 12 heads in parallel, concatenates their outputs, and projects back to 768 dimensions.

The MLM loss at masked position $i$ is cross-entropy over the vocabulary:

$L_{MLM} = - \sum_{i \in M} lo g P (x_{i} ∣ \tilde{x})$

where $M$ is the set of masked positions and $\tilde{x}$ is the corrupted input sequence.

The NSP loss is binary cross-entropy on the [CLS] representation $C \in R^{H}$ :

$L_{NSP} = - lo g P (IsNext ∣ C)$

Total pre-training loss: $L = L_{MLM} + L_{NSP}$

For fine-tuning on classification, a new weight matrix $W \in R^{K \times H}$ (K = number of labels) is added on top of C:

$P (label) = softmax (C W^{T})$

All pre-trained parameters update during fine-tuning. The only truly new parameters are W.

Numeric Walkthrough

Let’s trace one masked token prediction through BERT-Base.

Input: “The quick brown [MASK] jumps over the lazy dog” Target: “fox” (WordPiece token id: 3419)

Step 1. Token embedding lookup for [MASK] token (id 103): 768-dimensional vector $e_{MASK} \in R^{768}$ .

Step 2. Add position embedding (position 4, since [CLS] is position 0): $e_{pos4} \in R^{768}$ .

Step 3. Add segment embedding for sentence A: $e_{segA} \in R^{768}$ .

Result: input vector for [MASK] = $e_{MASK} + e_{pos4} + e_{segA}$ .

Step 4. Pass through 12 Transformer encoder layers. In each layer, the [MASK] token’s query vector attends to all 9 other tokens. In head 1 of layer 1 ( $d_{k} = 64$ ): the attention score between [MASK] and “quick” is $Q_{MASK} \cdot K_{quick} / 64 = Q_{MASK} \cdot K_{quick} /8$ .

Step 5. After 12 layers, the final hidden state $T_{4} \in R^{768}$ at position 4 encodes context from both “The quick brown” (left) and “jumps over the lazy dog” (right).

Step 6. Apply output softmax over 30,000 vocabulary tokens:

$P (word) = softmax (W_{vocab} \cdot T_{4} + b)$

where $W_{vocab} \in R^{30000 \times 768}$ .

If training is working, $P (fox) = P (token 3419)$ should be the highest probability. Cross-entropy loss: $- lo g P (fox)$ .

Over 1 million training steps on 3.3 billion words, this signal propagates through all 110M parameters and teaches the model what “fox” is, what “quick brown” implies, and what “jumps” needs as its subject.

What’s Clever

The 80/10/10 masking split is the subtle engineering that makes it work. If you always replace masked tokens with [MASK], the model sees [MASK] tokens during pre-training but never during fine-tuning. The distribution gap would hurt downstream performance. The 10% random replacement forces the model to maintain a contextual representation of every token even when it looks normal, because any token could secretly be the prediction target. The 10% unchanged forces the same. The model never knows which tokens it needs to reconstruct, so it encodes all of them carefully.

The second clever choice is architecture selection. The paper makes the distinction explicit: “the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.” This one constraint difference — BERT uses a Transformer encoder, GPT uses a Transformer decoder with causal masking — produces dramatically different capabilities. Encoders model representations; decoders model sequences. For classification and extraction, representations are what you want.

The [CLS] token design is elegant. Rather than mean-pooling all token representations at the end, BERT anchors the sequence-level representation to a fixed position (position 0) and trains NSP to put meaningful information there. The model learns to route whole-sequence information into [CLS] as a side effect of NSP training.

The paper also makes the direct comparison to ablations count: “We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pre-trained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.” Shallow bidirectionality (ELMo: train left-to-right separately, train right-to-left separately, concatenate) is not the same as deep bidirectionality (BERT: every layer attends in both directions simultaneously). The ablation studies in Section 5.1 confirm this directly: removing bidirectional pre-training costs 5-10 points on multiple benchmarks.

Part 3: Results and Where It Breaks

GLUE Benchmark

System	MNLI	QQP	QNLI	SST-2	CoLA	Average
Pre-OpenAI SOTA	80.6	66.1	82.3	93.2	35.0	74.0
OpenAI GPT	82.1	70.3	87.4	91.3	45.4	75.1
BERT-Base	84.6	71.2	90.5	93.5	52.1	79.6
BERT-Large	86.7	72.1	92.7	94.9	60.5	82.1

BERT-Large pushes the official GLUE leaderboard score to 80.5% — a 7.7 point absolute improvement over prior state of the art. BERT-Base, despite being the same size as GPT, outperforms GPT on every single GLUE task by changing only the attention direction.

SQuAD v1.1 (Question Answering)

System	Test F1
Human	91.2
Prior best ensemble	90.5
BERT-Large (Ensemble + TriviaQA)	93.2

BERT-Large surpasses human performance at Test F1 93.2 vs. 91.2. “Our best performing system outperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 as a single system.”

SQuAD v2.0

BERT-Large achieves Test F1 of 83.1, a +5.1 improvement over the previous best system. The trick for v2.0 (which has unanswerable questions) is routing the no-answer span through the [CLS] position: the model predicts “no answer” when the score of the [CLS] start/end exceeds the best token span score by a threshold $τ$ .

Where It Breaks

NSP turned out to be weakly beneficial at best. RoBERTa (2019) showed that removing NSP entirely and training longer with more data on MLM alone improves performance on most benchmarks. The NSP objective may not actually teach the model to understand inter-sentence relationships so much as teaching it to distinguish random sentences from coherent ones — a much easier task than the intended one.

The 512-token context window is a hard ceiling. Documents longer than 512 WordPiece tokens require chunking, which severs long-range dependencies exactly where understanding matters most for tasks like summarization or document classification.

The [MASK] mismatch problem never fully disappears. Despite the 80/10/10 workaround, [MASK] tokens still appear during pre-training in a way they don’t during fine-tuning. This distributional gap has motivated alternatives like ELECTRA (replace token detection) and XLNet (permutation language modeling).

BERT’s encoder architecture makes it structurally unsuited for generation. It can read text bidirectionally; it cannot generate text autoregressively. Tasks requiring open-ended generation need decoder architectures or encoder-decoder hybrids (T5, BART).

Full fine-tuning means a separate copy of 110M or 340M parameters per task. At scale, this is a storage disaster — exactly the problem LoRA was designed to solve by adding small low-rank adapters instead of updating all weights.

Part 4: Practitioner Notes

When to use BERT vs. GPT-style models: if your task is classification, extraction, NER, or semantic similarity — tasks that need to read and understand a fixed input — BERT-style encoders are the right choice. If your task involves generation, completion, or chain-of-thought reasoning, you want a decoder.

For fine-tuning: use a learning rate between 2e-5 and 5e-5, batch size 16-32, and 2-4 epochs. The paper reports finding fine-tuning sometimes unstable on small datasets with BERT-Large; multiple random restarts help. The best fine-tuning checkpoint is selected on the dev set, not the final epoch.

The pre-training corpus matters. BERT uses document-level corpora (BooksCorpus + Wikipedia) rather than shuffled sentence corpora. The paper is explicit: “It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark in order to extract long contiguous sequences.” MLM can only learn long-range dependencies if the training sequences are long.

If you need BERT in 2026, reach for a RoBERTa checkpoint instead. It removes NSP, trains longer, uses larger batches, and dynamically changes masking patterns across epochs — strictly better BERT with no architecture changes.

The tweetable version: mask 15% of tokens, train a bidirectional transformer to predict them from full context, fine-tune with one linear layer. That’s BERT. That’s also why it beat every benchmark in 2018 and defined NLP for the next three years.

Citation: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. https://arxiv.org/abs/1810.04805

ML Wiki

Explorer