Language Models are Few-Shot Learners

Summary

Brown et al. (2020) introduce GPT-3, an autoregressive language model with 175 billion parameters trained on ~300 billion tokens from a filtered Common Crawl corpus plus Books, Wikipedia, and other sources. The central finding is that sufficiently large language models can perform new tasks from just a few demonstrations provided in the prompt — without any gradient updates or fine-tuning. This in-context learning ability scales strongly with model size and was largely absent in smaller predecessors like GPT-2.

The paper evaluates GPT-3 under three regimes: zero-shot (task description only), one-shot (one example), and few-shot (up to ~100 examples in the context window). On many benchmarks, few-shot GPT-3 matches or approaches fine-tuned models trained specifically on those tasks. For example, it achieves 71.9% on TriviaQA (few-shot), 88.6% on LAMBADA (zero-shot), and generates news articles that human evaluators cannot reliably distinguish from human-written text. The paper also identifies important limitations: GPT-3 still fails on tasks requiring formal reasoning or strict symbolic manipulation, and its training data overlap with benchmarks creates difficult-to-quantify contamination issues.

GPT-3 catalyzed the modern LLM era by demonstrating that a single large pretrained model with no task-specific tuning could generalize broadly. It shifted the paradigm from “pretrain then fine-tune” toward “pretrain then prompt,” and its release as an API made large-scale NLP research accessible outside major labs.

Key Claims

GPT-3 has 175B parameters — 10× larger than any prior non-sparse language model at the time.
Few-shot GPT-3 achieves 71.9% on TriviaQA, 81.4% F1 on CoQA, and 88.6% accuracy on the LAMBADA completion task.
On SuperGLUE, few-shot GPT-3 scores 71.8, approaching fine-tuned BERT-Large (69.0) with zero gradient updates.
GPT-3 achieves 88.1% on SAT analogy questions, comparable to average college applicants.
News article generation quality is near-indistinguishable from human text (human detection accuracy 52%, near chance level).

Methods

GPT-3 is a decoder-only Transformer with 96 layers, d_model=12288, 96 attention heads, and context window of 2048 tokens. It uses the same architecture as GPT-2, scaled up. Training uses the Adam optimizer with a cosine learning rate schedule on a ~300B token corpus: filtered Common Crawl (60%), WebText2 (22%), Books1 (8%), Books2 (8%), Wikipedia (3%). The dataset is sampled proportionally to quality-weighted size rather than uniformly. In-context learning works by prepending task demonstrations directly into the input context — the model conditions on these examples autoregressively without any weight updates. The paper systematically varies context examples from 0 to ~100 to study few-shot scaling.

Failure modes

GPT-3 struggles on tasks requiring complex logical reasoning, arithmetic, or compositional generalization (e.g., ANLI adversarial NLI, WinoGrande).
Fine-tuned smaller models often outperform few-shot GPT-3 on structured prediction tasks, showing in-context learning is not a universal replacement for fine-tuning.
Training data contamination from benchmark test sets is present and hard to fully quantify; some results may be inflated.
The model is expensive to serve: 175B parameters requires multiple A100-class GPUs even for inference.
No instruction following or RLHF — GPT-3 base requires careful prompt engineering and often produces off-topic or biased continuations.

Connections

scaling-laws-neural-language-models — provides the theoretical basis predicting GPT-3’s gains from scale
bert-pre-training-of-deep-bidirectional-transformers — the dominant NLP paradigm GPT-3 challenged by replacing fine-tuning with prompting
training-compute-optimal-large-language-models — Chinchilla later showed GPT-3 was undertrained relative to its parameter count
llama-open-efficient-foundation-language-models — LLaMA-13B surpasses GPT-3 at 13× fewer parameters using Chinchilla-optimal training
training-language-models-to-follow-instructions-with-human-feedback — InstructGPT fine-tunes GPT-3 with RLHF to make it instruction-following
in-context-learning — the key capability demonstrated
transformer — underlying architecture
scaling-laws — the scaling behavior of in-context learning is a central empirical finding
emergent-abilities — few-shot performance appears and improves sharply at large scale
attention — 96-head multi-head self-attention over 2048-token context
chain-of-thought — few-shot examples in the context window are the precursor to chain-of-thought prompting
openai — primary institution

Citation

arXiv:2005.14165

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. https://arxiv.org/abs/2005.14165

ML Wiki

Explorer