Scaling Laws for Neural Language Models

Summary

Kaplan et al. (2020) empirically characterize how language model cross-entropy loss scales as a function of model size (N), dataset size (D), and compute budget (C) across more than seven orders of magnitude. The central finding is that each of these three factors independently follows a smooth power-law relationship with loss, with minimal interaction between them across the ranges studied. This predictability means that given a fixed compute budget, one can analytically determine the optimal model size and number of training tokens before running a single large experiment.

The key practical prescription from the Kaplan scaling laws is that larger models are significantly more sample-efficient: a model trained on fewer tokens but with more parameters achieves lower loss than a smaller model trained for longer. Concretely, for a fixed compute budget, the optimal strategy is to train the largest model that fits within the budget on a relatively modest amount of data, stopping well before convergence. This finding directly drove the design of GPT-3 (175B parameters, ~300B tokens). The paper also shows that architectural details — depth, width, attention heads — have minimal effect on final loss when parameter count is held constant.

Key Claims

Test loss scales as a power law in model size: L(N) ∝ N^{-0.076}, dataset size: L(D) ∝ D^{-0.095}, and compute: L(C) ∝ C^{-0.050}.
Power-law scaling holds across more than 7 orders of magnitude in each dimension.
Optimal compute allocation keeps model size growing as C^{0.73} — most of the compute budget should go to model size, not data.
Architecture details (depth/width ratio) have negligible effect on loss for fixed parameter count within a wide range.
Large models reach the same loss as small models with exponentially fewer training tokens (strong sample efficiency advantage).

Methods

The study trains over 1000 language models ranging from ~768 parameters to 1.5B parameters on WebText (the GPT-2 dataset) and variations thereof. Models are decoder-only Transformers evaluated on held-out test loss (bits-per-character / nats). Each axis (N, D, C) is isolated by fixing the other two. Compute C is estimated as C ≈ 6ND FLOPs per training run (6 multiply-adds per parameter per token). Optimal compute allocation is derived by fitting a parametric loss function L(N, D) and analytically minimizing under C = 6ND budget constraints.

Failure modes

Kaplan et al.’s optimal compute prescription (scale N faster than D) was later contradicted by Chinchilla (Hoffmann et al., 2022), which found equal scaling of N and D is optimal when total tokens are not fixed.
Power laws are fit to specific data distributions (WebText); extrapolation to other modalities or data mixtures may not hold.
The analysis ignores inference costs — training the largest possible model is optimal for training loss, but not for serving cost-per-query.
Emergent abilities (qualitative capability jumps) are not captured by the smooth loss curves; loss improvements do not always translate linearly to downstream task performance.

Connections

language-models-are-few-shot-learners — GPT-3 was designed using these scaling laws
training-compute-optimal-large-language-models — Chinchilla directly revises these laws, finding Kaplan’s C-optimal allocation undertrained on data
llama-open-efficient-foundation-language-models — LLaMA applies Chinchilla-revised scaling to train on more tokens than Kaplan would prescribe
emergent-abilities-of-large-language-models — emergent abilities are a downstream consequence of scaling
in-context-learning — scales strongly with model size
scaling-laws — the central contribution: power-law relationships across N, D, and C
emergent-abilities — smooth loss curves do not capture the qualitative jumps this paper notes as out-of-scope
transformer — all experiments use decoder-only Transformer models
inference-efficiency — Kaplan’s prescription (train large, stop early) is suboptimal for inference cost
openai — primary institution

Citation

arXiv:2001.08361

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint. https://arxiv.org/abs/2001.08361

ML Wiki

Explorer