Summary
Kaplan et al. (2020) empirically characterize how language model cross-entropy loss scales as a function of model size (N), dataset size (D), and compute budget (C) across more than seven orders of magnitude. The central finding is that each of these three factors independently follows a smooth power-law relationship with loss, with minimal interaction between them across the ranges studied. This predictability means that given a fixed compute budget, one can analytically determine the optimal model size and number of training tokens before running a single large experiment.
The key practical prescription from the Kaplan scaling laws is that larger models are significantly more sample-efficient: a model trained on fewer tokens but with more parameters achieves lower loss than a smaller model trained for longer. Concretely, for a fixed compute budget, the optimal strategy is to train the largest model that fits within the budget on a relatively modest amount of data, stopping well before convergence. This finding directly drove the design of GPT-3 (175B parameters, ~300B tokens). The paper also shows that architectural details — depth, width, attention heads — have minimal effect on final loss when parameter count is held constant.
Key Claims
- Test loss scales as a power law in model size: L(N) ∝ N^{-0.076}, dataset size: L(D) ∝ D^{-0.095}, and compute: L(C) ∝ C^{-0.050}.
- Power-law scaling holds across more than 7 orders of magnitude in each dimension.
- Optimal compute allocation keeps model size growing as C^{0.73} — most of the compute budget should go to model size, not data.
- Architecture details (depth/width ratio) have negligible effect on loss for fixed parameter count within a wide range.
- Large models reach the same loss as small models with exponentially fewer training tokens (strong sample efficiency advantage).
Methods
The study trains over 1000 language models ranging from ~768 parameters to 1.5B parameters on WebText (the GPT-2 dataset) and variations thereof. Models are decoder-only Transformers evaluated on held-out test loss (bits-per-character / nats). Each axis (N, D, C) is isolated by fixing the other two. Compute C is estimated as C ≈ 6ND FLOPs per training run (6 multiply-adds per parameter per token). Optimal compute allocation is derived by fitting a parametric loss function L(N, D) and analytically minimizing under C = 6ND budget constraints.
Failure modes
- Kaplan et al.’s optimal compute prescription (scale N faster than D) was later contradicted by Chinchilla (Hoffmann et al., 2022), which found equal scaling of N and D is optimal when total tokens are not fixed.
- Power laws are fit to specific data distributions (WebText); extrapolation to other modalities or data mixtures may not hold.
- The analysis ignores inference costs — training the largest possible model is optimal for training loss, but not for serving cost-per-query.
- Emergent abilities (qualitative capability jumps) are not captured by the smooth loss curves; loss improvements do not always translate linearly to downstream task performance.
Connections
- language-models-are-few-shot-learners — GPT-3 was designed using these scaling laws
- training-compute-optimal-large-language-models — Chinchilla directly revises these laws, finding Kaplan’s C-optimal allocation undertrained on data
- llama-open-efficient-foundation-language-models — LLaMA applies Chinchilla-revised scaling to train on more tokens than Kaplan would prescribe
- emergent-abilities-of-large-language-models — emergent abilities are a downstream consequence of scaling
- in-context-learning — scales strongly with model size
- scaling-laws — the central contribution: power-law relationships across N, D, and C
- emergent-abilities — smooth loss curves do not capture the qualitative jumps this paper notes as out-of-scope
- transformer — all experiments use decoder-only Transformer models
- inference-efficiency — Kaplan’s prescription (train large, stop early) is suboptimal for inference cost
- openai — primary institution
Citation
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint. https://arxiv.org/abs/2001.08361