Summary

Hoffmann et al. (2022) revisit the scaling laws of Kaplan et al. (2020) using a more rigorous experimental methodology — training over 400 models from 70M to 16B parameters on 5B to 500B tokens — and reach a starkly different conclusion. While Kaplan prescribed allocating most of a compute budget to model size, Hoffmann find that model size and training tokens should scale equally: for every doubling of model parameters, the number of training tokens should also double. The prior generation of large models (GPT-3 175B, Gopher 280B, MT-NLG 530B) were all significantly undertrained relative to their parameter counts.

To validate this finding, the authors train Chinchilla (70B parameters, 1.4T tokens) using the same compute budget as DeepMind’s Gopher (280B parameters, 300B tokens). Chinchilla — 4× smaller but trained on 4× more data — uniformly outperforms Gopher across every evaluated benchmark and also surpasses GPT-3, Jurassic-1 (178B), and Megatron-Turing NLG (530B). On MMLU, Chinchilla achieves 67.5% average accuracy versus Gopher’s 60.0%. The Chinchilla result became the dominant scaling guideline for LLM training, directly influencing LLaMA, Mistral, and most subsequent open-weight models.

Key Claims

  • Optimal compute-efficient training requires equal scaling of model size and training tokens: N_opt ∝ C^{0.5} and D_opt ∝ C^{0.5}.
  • Prior models (GPT-3, Gopher) were undertrained by ~4× in tokens relative to their parameter count.
  • Chinchilla (70B / 1.4T tokens) outperforms Gopher (280B / 300B tokens) on every downstream benchmark despite using the same compute.
  • Chinchilla achieves 67.5% on MMLU, a 7%+ improvement over Gopher’s 60.0%.
  • The rule of thumb derived: ~20 training tokens per model parameter for compute-optimal training.

Methods

Three approaches are used to estimate optimal (N, D) allocation: (1) fixing compute C and varying the N/D split across many small runs; (2) fitting a parametric IsoFLOP curve L(N, D) = E + A/N^α + B/D^β to observed losses; (3) fitting a parametric model and extrapolating to larger compute. All three converge on the equal-scaling result. The Chinchilla model uses the same Transformer architecture as Gopher but at 70B parameters, trained on 1.4T tokens from MassiveText (a filtered web + book + code corpus). Evaluation spans language modeling, reading comprehension, common-sense reasoning (BIG-bench), and MMLU.

Failure modes

  • The 20-tokens-per-parameter rule assumes training is the only cost; inference-heavy deployments often prefer smaller models trained on far more tokens (e.g., LLaMA trains 7B on 1T tokens, ~143 tokens/param).
  • Chinchilla’s results are on the specific MassiveText data mixture; optimal ratios may differ for code, math, or multilingual data.
  • The parametric loss fits are noisy at the tails; extrapolation to very large compute regimes (beyond ~10²⁶ FLOPs) is uncertain.
  • MMLU and BIG-bench few-shot accuracy are not purely smooth functions of scale; some tasks show non-monotonic improvement.

Connections

Citation

arXiv:2203.15556

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., … Sifre, L. (2022). Training Compute-Optimal Large Language Models. arXiv preprint. https://arxiv.org/abs/2203.15556