Summary

Devlin et al. (2018) introduce BERT, a pretrained language representation model based on the Transformer encoder. Unlike GPT-style left-to-right language models, BERT conditions on both left and right context simultaneously in every layer by training with two self-supervised objectives: Masked Language Modeling (MLM), where 15% of tokens are masked and the model must predict them, and Next Sentence Prediction (NSP), where the model predicts whether two text segments are consecutive. This bidirectional pretraining allows BERT to build richer contextual representations than unidirectional models.

Fine-tuning BERT on downstream tasks requires adding only one output layer on top of the pretrained model and training end-to-end on task-specific data. This simplicity combined with strong pretrained features produces state-of-the-art results across 11 NLP tasks. BERT-Large pushes the GLUE benchmark to 80.5% (7.7% absolute improvement over the prior state of the art), SQuAD v1.1 Test F1 to 93.2 (1.5 point improvement), and SQuAD v2.0 Test F1 to 83.1 (5.1 point improvement). BERT became the dominant NLP backbone from 2018–2020, spawning a family of variants (RoBERTa, ALBERT, DeBERTa) and establishing “pretrain then fine-tune” as the standard NLP workflow.

Key Claims

  • BERT-Large achieves 80.5% on GLUE — a 7.7% absolute improvement over prior state of the art.
  • SQuAD v1.1 Test F1 of 93.2 and SQuAD v2.0 Test F1 of 83.1, surpassing human performance on v1.1 (91.2 F1).
  • MultiNLI accuracy reaches 86.7% (4.6% absolute improvement).
  • Bidirectional pretraining outperforms both left-to-right and shallow concatenated left+right approaches across all tasks.
  • BERT-Base (110M parameters) already outperforms much larger task-specific architectures on most benchmarks.

Methods

BERT uses a Transformer encoder stack. BERT-Base: 12 layers, d_model=768, 12 attention heads, 110M parameters. BERT-Large: 24 layers, d_model=1024, 16 attention heads, 340M parameters. Pretraining uses two objectives: (1) MLM — 15% of input tokens are selected; 80% replaced with [MASK], 10% replaced with a random token, 10% left unchanged; the model predicts the original token at masked positions. (2) NSP — given sentence pairs, the model predicts a [CLS]-pooled binary label for whether B follows A. Input representations sum token embeddings, segment embeddings (A/B), and learned positional embeddings. Fine-tuning feeds task inputs as token sequences and uses the [CLS] embedding or per-token representations depending on task type (classification vs. span extraction).

Failure modes

  • NSP objective was later shown by RoBERTa to be weakly beneficial or even harmful; removing it with more MLM data improves performance.
  • BERT’s fixed 512-token context window limits applicability to long documents.
  • MLM pretraining creates a train-test discrepancy: [MASK] tokens appear during pretraining but not fine-tuning.
  • BERT encoder architecture is not suited for autoregressive generation tasks.
  • Requires full model fine-tuning per task, which is compute- and storage-intensive at scale (addressed later by LoRA and other PEFT methods).

Connections

Citation

arXiv:1810.04805

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. https://arxiv.org/abs/1810.04805