What It Is

A phase transition in ML is a sudden, qualitative change in model behavior at a threshold — as opposed to gradual improvement. Below the threshold: near-random performance. Above: competence. The transition is sharp enough that you can’t predict when it will happen from measurements taken before it occurs. The term borrows from physics, where phase transitions (liquid→gas, paramagnet→ferromagnet) produce discontinuous changes in macroscopic properties despite continuous changes in underlying variables.

Why It Matters

Phase transitions make capability prediction unreliable. Smooth extrapolation from small scale tells you nothing about when the next jump happens. This is practically important — resource planning, safety evaluation, and capability forecasting all assume some continuity in scaling behavior. A model that performs at chance on a benchmark for 100B parameters may jump to 80% accuracy at 101B. This is both a research planning challenge and a safety concern: dangerous capabilities could emerge without warning from continuous scaling.

Examples in Practice

Emergent Abilities at Scale

Wei et al. (2022) documented dozens of capabilities in LLMs that appear suddenly above scale thresholds. Examples:

  • 3-digit addition: Near-random for models <6B parameters; jumps to >80% accuracy above ~10B.
  • Chain-of-thought reasoning: Hurts small models, dramatically helps large ones. The crossover is a phase transition in how the model uses chain of thought.
  • In-context learning: Few-shot prompting with K examples degrades small models (the extra context confuses them) and helps large models (they extract the task pattern). The sign of the effect flips at scale.
Capability (% correct)

100% │                                        ████████████
     │                                   ████
 50% │                              ████
     │        ____________________██
  0% │_______/                  
     └─────────────────────────────────────────── 
     1B      10B     100B    model parameters
            ↑
     Phase transition: capability jumps discontinuously

Grokking

A model memorizes training data (zero training loss, poor test accuracy), then suddenly generalizes long after training would normally have stopped — sometimes after 10× more training steps. The transition from memorization to generalization is abrupt. Below the threshold, the model has “overfit.” Above, it has discovered an underlying structure.

Key insight: grokking is a phase transition in the complexity of the learned function. Regularization (weight decay) provides the pressure that forces the model past the transition — it becomes cheaper to generalize than to memorize once regularization is strong enough and training continues long enough.

Double Descent in Training Loss

As model size increases, test loss follows a “double descent” curve: first decreases (classical regime), then increases (overfitting), then decreases again (modern overparameterized regime). The interpolation threshold — where the model first fits the training data exactly — is a phase transition point where the loss landscape changes qualitatively.

In-Context Learning as Phase Transition

For K-shot prompting (K examples in the prompt), small models get worse as K increases. Large models get better. The crossover at some scale threshold is a phase transition in the sign of the K-shot effect. Below: more examples = more confusion. Above: more examples = better task inference.

Why Scaling Laws Don’t Predict Phase Transitions

Scaling laws fit smooth power-law curves to validation loss. But loss can be smooth while task accuracy is discontinuous: a model at chance on arithmetic has many “near-miss” outputs (off-by-one errors, etc.) that give smooth loss, but zero accuracy. The phase transition happens when the model crosses from “many near-misses” to “getting it right” — a threshold in the task metric that’s invisible in the loss curve.

This is why emergent abilities appear discontinuous even though the underlying learning process (loss minimization) is continuous. The discontinuity is in the evaluation metric, not the training objective.

Loss curve:              smooth and monotonically decreasing
Task accuracy curve:     _______________/‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
                                        ↑
                                   Phase transition

What’s Clever

The physics analogy is more than metaphorical. In statistical physics, phase transitions occur when the free energy landscape changes topology — a new minimum appears. In ML, phase transitions occur when the loss landscape develops a new attractor that corresponds to the generalized solution. Below the transition, the only stable attractor is memorization or a degenerate solution. Above, a new attractor appears and training dynamics pull toward it.

For grokking specifically, the transition has been formalized: the model needs sufficient “circuit capacity” to represent the general algorithm (e.g., modular arithmetic via Fourier features). Below the phase transition, this circuit hasn’t formed. At the transition, it crystallizes — suddenly, the generalized solution is reachable.

Practical implication: you cannot evaluate safety-relevant capabilities at small scale and confidently extrapolate to large scale. Phase transitions can introduce capabilities (or failure modes) that are genuinely absent at smaller scale.

Key Sources

  • emergent-abilities — emergent abilities are phase transitions in capability vs. scale
  • scaling-laws — scaling laws describe smooth loss curves that miss the discontinuities phase transitions introduce
  • grokking — the specific phase transition from memorization to generalization
  • emergent-behavior — broader category of sharp capability thresholds

Open Questions

  • Can phase transitions be predicted in advance, or are they fundamentally unpredictable from below-threshold measurements?
  • Are emergent abilities actually sharp, or artifacts of discontinuous evaluation metrics applied to smooth underlying functions?
  • Do phase transitions occur along axes other than scale (training time, data quality, architecture changes)?
  • What is the mechanism — in terms of circuit formation — that produces the sharp transition in grokking?