BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Summary

Devlin et al. (2018) introduce BERT, a pretrained language representation model based on the Transformer encoder. Unlike GPT-style left-to-right language models, BERT conditions on both left and right context simultaneously in every layer by training with two self-supervised objectives: Masked Language Modeling (MLM), where 15% of tokens are masked and the model must predict them, and Next Sentence Prediction (NSP), where the model predicts whether two text segments are consecutive. This bidirectional pretraining allows BERT to build richer contextual representations than unidirectional models.

Fine-tuning BERT on downstream tasks requires adding only one output layer on top of the pretrained model and training end-to-end on task-specific data. This simplicity combined with strong pretrained features produces state-of-the-art results across 11 NLP tasks. BERT-Large pushes the GLUE benchmark to 80.5% (7.7% absolute improvement over the prior state of the art), SQuAD v1.1 Test F1 to 93.2 (1.5 point improvement), and SQuAD v2.0 Test F1 to 83.1 (5.1 point improvement). BERT became the dominant NLP backbone from 2018–2020, spawning a family of variants (RoBERTa, ALBERT, DeBERTa) and establishing “pretrain then fine-tune” as the standard NLP workflow.

Key Claims

BERT-Large achieves 80.5% on GLUE — a 7.7% absolute improvement over prior state of the art.
SQuAD v1.1 Test F1 of 93.2 and SQuAD v2.0 Test F1 of 83.1, surpassing human performance on v1.1 (91.2 F1).
MultiNLI accuracy reaches 86.7% (4.6% absolute improvement).
Bidirectional pretraining outperforms both left-to-right and shallow concatenated left+right approaches across all tasks.
BERT-Base (110M parameters) already outperforms much larger task-specific architectures on most benchmarks.

Methods

BERT uses a Transformer encoder stack. BERT-Base: 12 layers, d_model=768, 12 attention heads, 110M parameters. BERT-Large: 24 layers, d_model=1024, 16 attention heads, 340M parameters. Pretraining uses two objectives: (1) MLM — 15% of input tokens are selected; 80% replaced with [MASK], 10% replaced with a random token, 10% left unchanged; the model predicts the original token at masked positions. (2) NSP — given sentence pairs, the model predicts a [CLS]-pooled binary label for whether B follows A. Input representations sum token embeddings, segment embeddings (A/B), and learned positional embeddings. Fine-tuning feeds task inputs as token sequences and uses the [CLS] embedding or per-token representations depending on task type (classification vs. span extraction).

Failure modes

NSP objective was later shown by RoBERTa to be weakly beneficial or even harmful; removing it with more MLM data improves performance.
BERT’s fixed 512-token context window limits applicability to long documents.
MLM pretraining creates a train-test discrepancy: [MASK] tokens appear during pretraining but not fine-tuning.
BERT encoder architecture is not suited for autoregressive generation tasks.
Requires full model fine-tuning per task, which is compute- and storage-intensive at scale (addressed later by LoRA and other PEFT methods).

Connections

attention-is-all-you-need — Transformer encoder architecture BERT is built on
language-models-are-few-shot-learners — GPT-3 later showed prompting could replace BERT-style fine-tuning for many tasks
lora-low-rank-adaptation — LoRA makes BERT-style fine-tuning parameter-efficient
transformer — core architecture
sft — BERT fine-tuning is the canonical pretrain-then-fine-tune example
attention — bidirectional self-attention over all tokens is the key architectural choice
classification-token — [CLS] token introduced here as the aggregate sequence representation
transfer-learning — the pretrain-then-fine-tune paradigm BERT establishes as the NLP standard
google-research — primary institution

Citation

arXiv:1810.04805

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. https://arxiv.org/abs/1810.04805

ML Wiki

Explorer