Emergent Abilities of Large Language Models

The problem

Kaplan et al.’s scaling laws gave the field a clean story: more compute = predictably better performance. Log-log line, extrapolate, done. But then GPT-3 showed up and could suddenly do arithmetic — something smaller GPT models flat-out couldn’t. Not worse at arithmetic — incapable of it. How do you explain a capability that doesn’t exist at all and then exists? That breaks the extrapolation. The authors call this emergence.

Core idea

Phase transitions, not gradients

Think about water. You heat ice slowly, one degree at a time. Nothing interesting happens for a long time — it just gets colder ice. Then, at exactly 0°C, something qualitatively different happens: it becomes liquid. The heating was continuous; the behavior change was discontinuous. The same thing happens at 100°C: water to steam.

GPT’s arithmetic ability works exactly like this. It doesn’t get incrementally worse at arithmetic and then gradually better. It is incapable — and then it can do it. The authors put it precisely:

“An ability is emergent if it is not present in smaller models but is present in larger models. Emergent abilities would not have been directly predicted by extrapolating a scaling law (i.e. consistent performance improvements) from small-scale models.”

The philosophical framing comes from physics. The paper borrows from Anderson’s 1972 “More Is Different”: emergence is when quantitative changes in a system produce qualitative changes in behavior. You can count atoms; you can’t predict consciousness from the count. Same principle.

Why does it happen?

There’s no fully satisfying mechanistic answer — the paper is honest about this: “there are currently few compelling explanations for why such abilities emerge in the way they do.” But here’s an intuitive sketch.

Some tasks require multiple sub-skills to all be co-present. 3-digit multiplication needs: place-value understanding, carrying rules, multiplication tables, and multi-step tracking. At small scale, none of these sub-skills reaches sufficient quality individually. At large scale, all of them cross a competence threshold simultaneously — and the composite capability snaps into existence. No single sub-skill was the bottleneck; they all were.

The paper uses the term “phase transition” in a technical sense:

“performance is near-random until a certain critical threshold of scale is reached, after which performance increases to substantially above random. This qualitative change is also known as a phase transition”

The shape of the curve

Here’s what the scaling curve looks like for emergent abilities (versus normal smooth improvement):

Performance
  |
  |                              /-----
  |                             /
  |                            /
  |---------------------------/
  |_ _ _ _ (random) _ _ _ _ _
  +--------------------------------> Model scale (FLOPs)
         10^20   10^22   10^24

For a non-emergent capability, you’d see a diagonal line from the start. For emergent ones, you get a flat line at random performance — indistinguishable from noise — then a sharp bend upward past a threshold. The bend is the phase transition.

Real numbers

Let’s walk through the actual data.

MMLU benchmark (57 academic subjects: math, law, medicine, history, physics): this is a 4-way multiple-choice test, so random guessing = 25%.

Model	Parameters	MMLU score
GPT-3	7B	~25% (random)
GPT-3	13B	~26% (barely above random)
Gopher	70B	~35%
Chinchilla	70B	~47%
Gopher	280B	~60%

From 13B to 70B — roughly a 5x increase in parameters — the model crosses the line from “basically guessing” to “actually knows things.” Below that threshold, years of scaling produced nothing measurable. Above it, every additional compute dollar buys real performance.

Word in Context (WiC) is even starker: the task requires disambiguating word meanings in context. GPT-3 at 175B parameters (3×10²³ FLOPs) scores at random. The ability only appears at PaLM 540B (2.5×10²⁴ FLOPs) — nearly 10x more compute than GPT-3. If you’d evaluated the GPT-3 family and extrapolated, you’d have concluded WiC was an unsolvable task.

TruthfulQA: near-random for all models tested — GPT-3, Gopher up to 280B — until Gopher 280B shows a small signal. The threshold is above the compute budgets tested at time of writing.

3-digit arithmetic (GPT-3): near-random below 13B parameters (2.3×10²² FLOPs). Jumps to well above random at 13B. The capability essentially didn’t exist, then it did.

Chain-of-thought is emergent too

Chain-of-thought prompting — showing the model intermediate reasoning steps — is itself subject to emergence. This is counterintuitive and practically important.

At 8B parameters (LaMDA), adding chain-of-thought hurts GSM8K accuracy by about 2%. The model tries to follow the reasoning format, produces incoherent intermediate steps, and ends up wrong more often than if it just answered directly. At ~68B parameters, CoT becomes net-positive. Above ~100B parameters (10²³ FLOPs), it’s essential — the difference between adequate and excellent.

If you evaluated chain-of-thought on a 7B model and got negative results, you’d be right — but you’d be wrong to generalize that result to 70B models.

The same pattern holds for instruction following. Fine-tuning on instruction datasets (FLAN) hurts performance below ~8B parameters. Above ~68B, it’s the foundation of the modern assistant model.

What’s actually clever about this paper

The contribution isn’t the capabilities themselves — it’s documenting the discontinuity. Scaling laws were a framework that said “we can predict the future from the present by extrapolating the trend.” This paper shows the framework breaks down for qualitative capabilities. The implication is sharp: you cannot evaluate whether a model family has a given ability by testing small models from that family.

The paper also notes a deeply uncomfortable symmetry:

“risks could also emerge” — emergent risks parallel emergent capabilities (bias, toxicity, memorization increase with scale too)

The same phase-transition logic that gives you arithmetic at 13B might give you convincing misinformation generation at 70B, or a previously-absent failure mode at 540B. The knife cuts both ways.

Results

Task	Below threshold	Threshold	Above threshold
3-digit arithmetic	~0% (GPT-3 <13B)	13B params (2.3×10²² FLOPs)	Well above random
MMLU benchmark	~25% random (<10B)	70B–280B	~60% (Gopher 280B)
Chain-of-thought	Hurts (<68B LaMDA)	~68B–100B	Essential (>100B)
Word in Context	Random even at GPT-3 175B	PaLM 540B (2.5×10²⁴ FLOPs)	Above random
TruthfulQA	Near-random across GPT-3, Gopher	>280B	Not yet observed

What doesn’t work (or what’s missing)

(a) No mechanistic explanation. The paper documents emergence extensively but cannot explain why the phase transition happens at a particular scale. This is a genuine gap — the paper is honest about it, and it remains an open problem in interpretability research.

(b) Emergent risks. Toxicity, memorization, and bias also scale. This means the same dynamic that makes models unexpectedly capable at 70B might make them unexpectedly unsafe in ways that weren’t visible during safety evaluation at 7B.

(c) Possible metric artifact. Exact-match scoring gives zero credit for “almost right” multi-step solutions. Cross-entropy loss (the training signal) actually improves smoothly even when accuracy looks flat. The sharp apparent jump in accuracy might partly be an artifact of the binary scoring, not a true discontinuity in model capability.

(d) Scale isn’t the only knob. PaLM 62B outperforms both LaMDA 137B and GPT-3 175B on 14 BIG-Bench tasks — a smaller model beating larger models in its own category. Data quality, architecture design, and training objectives can shift the emergence threshold substantially.

(e) Distribution matters. Some abilities may never emerge if they’re fundamentally out of training distribution. Emergence happens within the space of what the training data can support.

So what

For practitioners: If you’re evaluating whether a model family can handle complex reasoning, multi-step arithmetic, or instruction-following — don’t test at 7B and extrapolate. The behavior isn’t there yet. The practical implication: for tasks requiring multi-step reasoning, you either go big (70B+) or you redesign the task to not require emergence.

For the broader field: This paper is why the community shifted from “scaling is predictable” to “we don’t fully know what will appear next.” It’s part of why alignment research became urgent: if capabilities can appear suddenly and unpredictably, waiting until they appear to think about alignment is too late.

Tweetable: GPT-3 can do arithmetic; GPT-2 can’t — not worse, cannot. Scaling creates entirely new capabilities that smaller models don’t hint at. This paper names and documents that phenomenon.

Connections

transformer — the substrate on which all this emergence happens
scaling-laws — emergence challenges the simple extrapolation view from scaling laws
lora — fine-tuning builds on emergent base capabilities

Citation

arXiv:2206.07682

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … & Fedus, W. (2022). Emergent abilities of large language models. arXiv:2206.07682.

ML Wiki

Explorer