Are emergent abilities real or a metric artifact?

Why this matters

If emergent abilities are real — capabilities that genuinely appear sharply at scale — then scaling is qualitatively unpredictable, and small-model evals don’t tell you what frontier models can do. If they’re metric artifacts — produced by thresholded or all-or-nothing scoring on capabilities that improve smoothly underneath — then scaling is predictable, and the “emergence” framing has misled the field’s intuition about safety, capability forecasting, and the value of scale itself.

Current best understanding

(2026-04-28) The metric-artifact hypothesis (Schaeffer et al., 2023) is well-supported for the original claimed examples. When the same capabilities are scored with continuous metrics (e.g. token-level edit distance instead of exact match), the curves smooth out. But this doesn’t mean nothing emerges — it means the early evidence was overstated, and the strong “phase change” framing needs a real existence proof.

The harder question is whether some capabilities (not the ones in the original paper) are genuinely discontinuous at scale. Chain-of-thought benefits look closer to genuine emergence: very small models gain nothing from CoT prompting, large models gain a lot. Whether that’s a true phase change or a smoothly-thresholded-by-base-capability effect is unresolved.

Evidence

emergent-world-representations-othello-gpt — (2026-05-03) Li et al. 2023. GPT trained only on Othello moves develops a causal internal world model of the board — not scale-threshold emergence, but evidence that sequence models build genuine internal structure rather than surface statistics. Addresses the prior question: can something real emerge from next-token prediction?
emergent-abilities-of-large-language-models — Wei et al. 2022, the original claim. Many examples (BIG-Bench tasks, multi-step arithmetic) showing sharp transitions.
emergent-abilities — Aggregates the debate.
phase-transition — The theoretical framing (a true phase change vs. a smooth process under a thresholded metric).
grokking-generalization-beyond-overfitting / grokking-systematic-empirical-study — Grokking is a training-time discontinuity, not a scale-time one — but it’s evidence that genuine non-monotonic transitions exist in NN training.
chain-of-thought-prompting — CoT benefits are scale-dependent in a way that looks emergent in the original sense.
training-compute-optimal-large-language-models — (2026-05-04) Chinchilla; Hoffmann et al. 2022. Pretraining loss follows a smooth power law L(N,D) = E + A/N^α + B/D^β across 400+ training runs spanning 70M to 16B parameters. Supports the metric-artifact hypothesis: if the underlying capability signal is smooth, task-level discontinuities are more likely a property of thresholded evaluation metrics than genuine phase transitions.

What would settle it

A capability that remains discontinuous under any reasonable continuous metric at multiple scales, replicated across model families.
A theoretical account predicting which capabilities should and shouldn’t emerge sharply, validated against new scaling runs.
Mechanistic interpretability finding distinct circuits that “click into place” at threshold scale — direct evidence of a genuine phase change, not a metric artifact.

ML Wiki

Explorer

Are emergent abilities real or a metric artifact?

Why this matters

Current best understanding

Evidence

What would settle it

Graph View

Table of Contents

Backlinks