Summary

Brown et al. (2020) introduce GPT-3, an autoregressive language model with 175 billion parameters trained on ~300 billion tokens from a filtered Common Crawl corpus plus Books, Wikipedia, and other sources. The central finding is that sufficiently large language models can perform new tasks from just a few demonstrations provided in the prompt — without any gradient updates or fine-tuning. This in-context learning ability scales strongly with model size and was largely absent in smaller predecessors like GPT-2.

The paper evaluates GPT-3 under three regimes: zero-shot (task description only), one-shot (one example), and few-shot (up to ~100 examples in the context window). On many benchmarks, few-shot GPT-3 matches or approaches fine-tuned models trained specifically on those tasks. For example, it achieves 71.9% on TriviaQA (few-shot), 88.6% on LAMBADA (zero-shot), and generates news articles that human evaluators cannot reliably distinguish from human-written text. The paper also identifies important limitations: GPT-3 still fails on tasks requiring formal reasoning or strict symbolic manipulation, and its training data overlap with benchmarks creates difficult-to-quantify contamination issues.

GPT-3 catalyzed the modern LLM era by demonstrating that a single large pretrained model with no task-specific tuning could generalize broadly. It shifted the paradigm from “pretrain then fine-tune” toward “pretrain then prompt,” and its release as an API made large-scale NLP research accessible outside major labs.

Key Claims

  • GPT-3 has 175B parameters — 10× larger than any prior non-sparse language model at the time.
  • Few-shot GPT-3 achieves 71.9% on TriviaQA, 81.4% F1 on CoQA, and 88.6% accuracy on the LAMBADA completion task.
  • On SuperGLUE, few-shot GPT-3 scores 71.8, approaching fine-tuned BERT-Large (69.0) with zero gradient updates.
  • GPT-3 achieves 88.1% on SAT analogy questions, comparable to average college applicants.
  • News article generation quality is near-indistinguishable from human text (human detection accuracy 52%, near chance level).

Methods

GPT-3 is a decoder-only Transformer with 96 layers, d_model=12288, 96 attention heads, and context window of 2048 tokens. It uses the same architecture as GPT-2, scaled up. Training uses the Adam optimizer with a cosine learning rate schedule on a ~300B token corpus: filtered Common Crawl (60%), WebText2 (22%), Books1 (8%), Books2 (8%), Wikipedia (3%). The dataset is sampled proportionally to quality-weighted size rather than uniformly. In-context learning works by prepending task demonstrations directly into the input context — the model conditions on these examples autoregressively without any weight updates. The paper systematically varies context examples from 0 to ~100 to study few-shot scaling.

Failure modes

  • GPT-3 struggles on tasks requiring complex logical reasoning, arithmetic, or compositional generalization (e.g., ANLI adversarial NLI, WinoGrande).
  • Fine-tuned smaller models often outperform few-shot GPT-3 on structured prediction tasks, showing in-context learning is not a universal replacement for fine-tuning.
  • Training data contamination from benchmark test sets is present and hard to fully quantify; some results may be inflated.
  • The model is expensive to serve: 175B parameters requires multiple A100-class GPUs even for inference.
  • No instruction following or RLHF — GPT-3 base requires careful prompt engineering and often produces off-topic or biased continuations.

Connections

Citation

arXiv:2005.14165

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. https://arxiv.org/abs/2005.14165