Training Compute-Optimal Large Language Models (Chinchilla)

Concepts: scaling-laws | compute-optimal-training | pre-training | inference-efficiency Builds on: scaling-laws-for-neural-language-models | language-models-are-few-shot-learners Leads to: llama-open-efficient-foundation-language-models | llama-2-open-foundation-fine-tuned-chat-models

The year is 2022. Three of the world’s most capable language models have 175B, 280B, and 530B parameters. All three are trained on roughly 300B tokens. The shared belief shaping every major lab’s roadmap: more parameters means better models. A 22-person team at DeepMind trains 400 models and discovers the entire field is optimizing for the wrong variable.

The core idea

The analogy: Imagine two medical students preparing for board exams on the same fixed budget. Student A spends most of the budget on the most comprehensive medical library available — enormous capacity to store knowledge — and a fraction on practice problems. Student B rents a reasonably thorough library and splits the remaining budget evenly between reading and solving practice problems. When exam day comes, Student B wins, consistently.

The library is the model, measured in parameters. The practice problems are training tokens, the actual text the model sees during training. Before Chinchilla, every major lab was Student A: maximizing library size, dramatically under-investing in practice.

“We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant.”

Translation: the entire field was solving for the wrong axis.

The flaw in Kaplan et al. (2020):

The scaling-laws-for-neural-language-models paper ran ablations the natural way: fix a compute budget, train models of different sizes, keep the dataset constant across experiments. Clean experiment — but it confounds two variables. When you hold D fixed and increase N, you’re simultaneously decreasing the number of tokens per parameter. You’re making data artificially sparse per unit of model capacity. Of course N looks more important: you’re not giving D a fair chance to contribute.

The Kaplan conclusion: parameter exponent dominates, so prioritize model size. This led to GPT-3 (175B parameters, 300B tokens — about 1.7 tokens/parameter), Gopher (280B, 300B tokens — about 1.1 tokens/param), and MT-NLG (530B, 270B tokens — a staggering 0.5 tokens/param). Every model was dramatically data-starved relative to its capacity.

The fix: IsoFLOP profiles:

The Hoffmann team’s methodology change: for a fixed compute budget C (in FLOPs), train many models of different sizes and scale the training data so every model uses the same total compute. Then find which model size achieves the lowest loss.

Step by step:

Fix compute budget C (total FLOPs)
Note that C ≈ 6ND — each of N parameters is touched approximately 6 times per training token (one forward pass, two backward passes through the chain rule)
For each candidate model size N, set D = C / (6N) — allocate the remaining budget to data
Train each model to completion; record final validation loss
Find the N that minimizes loss — that’s the optimal model size for this compute budget
Repeat across a range of compute budgets C

When you plot the optimal (N, D) pairs across budgets, they fall on a line with slope ≈ 1 in log-log space. Model size and data should grow at the same rate.

KAPLAN METHOD — data held fixed, model varied:
  Budget C=4e22 FLOPs:   N=1B,  D=7B    loss=2.14 ← (D constrained to 7B for all)
  Budget C=4e22 FLOPs:   N=3B,  D=2.3B  loss=2.20  ← too big, not enough data
  Budget C=4e22 FLOPs:   N=400M,D=17B   loss=2.18
  Conclusion: 1B wins → "scale model size!"
  (But this is confounded — 3B model had only 2.3B tokens, starved)

CHINCHILLA METHOD — IsoFLOP, both vary:
  Budget C=4e22 FLOPs:   N=400M, D=17B  loss=2.18
                         N=1B,   D=7B   loss=2.14  ← true minimum
                         N=3B,   D=2.3B loss=2.20
  Budget C=2e23 FLOPs:   N=1B,   D=33B  loss=2.05
                         N=3B,   D=11B  loss=1.98  ← minimum shifts with budget
                         N=10B,  D=3.3B loss=2.04
  Budget C=5e23 FLOPs:   N=3B,   D=28B  loss=1.90
                         N=10B,  D=8B   loss=1.84  ← minimum
                         N=70B,  D=1.2B loss=1.95  ← too big again

  Pattern: optimal N scales as ~C^0.5, optimal D scales as ~C^0.5
  Rule of thumb: D_opt ≈ 20 × N_opt

“We observe that all three approaches suggest that as compute budget increases, model size and the amount of training data should be increased in approximately equal proportions.”

The math, translated:

The team fits a parametric loss model across all 400+ training runs:

$L (N, D) = E + \frac{A}{N ^{α}} + \frac{B}{D ^{β}}$

where:

$E \approx 1.69$ — irreducible entropy of the data distribution. No model can go below this; it’s the information-theoretic floor of the text itself.
$A / N^{α}$ — excess loss from finite model size. More parameters compress the data better. Fitted: $A \approx 406$ , $α \approx 0.34$ .
$B / D^{β}$ — excess loss from finite training data. More tokens means better coverage of the data distribution. Fitted: $B \approx 410$ , $β \approx 0.28$ .

To find the optimal allocation for a fixed compute budget $C = 6 N D$ , minimize $L (N, D)$ subject to this constraint. Using Lagrange multipliers, the optimality condition is:

$\frac{α A}{N ^{α + 1}} = \frac{βB}{D ^{β + 1}}$

Read this as: at the optimum, the marginal loss reduction from adding one more parameter equals the marginal loss reduction from adding one more training token. If parameters are more valuable at the margin, reallocate budget toward them. If data is more valuable, reallocate toward data.

Solving with the constraint:

$N_{opt} \propto C^{\frac{α}{α + β}} \approx C^{0.55}, D_{opt} \propto C^{\frac{β}{α + β}} \approx C^{0.45}$

Both exponents are close to 0.5, which gives the clean rule: scale N and D equally.

Walkthrough with actual numbers:

Gopher’s training budget: $C = 6 \times 280 B \times 300 B \approx 5.0 \times 1 0^{23}$ FLOPs.

Compute-optimal model for the same budget, using $D = 20 N$ :

C = 6 × N × D = 6 × N × 20N = 120N²

N² = C / 120 = 5.0×10²³ / 120 ≈ 4.2×10²¹

N_opt ≈ 65B parameters
D_opt = 20 × 65B = 1.3T tokens

Verify: 6 × 65B × 1.3T = 5.07×10²³ ≈ C  ✓

Actual Chinchilla:   70B params × 1.4T tokens
Verify: 6 × 70B × 1.4T = 5.88×10²³  (≈15% more compute, same order)

Gopher:      280B params × 300B tokens → tokens/param = 1.07
Chinchilla:   70B params × 1.4T tokens → tokens/param = 20.0  (18.7× more data-dense)

Gopher was spending 4× its budget on model capacity it couldn’t fill, and 4× too little on the data that would fill it. The model had vast unused representational capacity because it never saw enough text.

What’s clever: the methodology is the finding

The key instinct: how you run ablations determines what you can conclude. Kaplan’s team did the natural experiment — vary model size, hold data fixed — and got a clean but misleading answer. The fix, IsoFLOP profiles, isn’t a new algorithm. It’s a different question: “given this compute budget, where should I put it?” instead of “which model size converges faster?”

This is the kind of paper that teaches you as much about experimental design as about machine learning. The finding was always there, embedded in the loss surface. It just required asking the right question.

The second non-obvious insight: the training-deployment tradeoff. Chinchilla gives compute-optimal training allocation. But inference is a recurring cost paid with every user query. A 70B model costs roughly 4× more per inference call than a 17B model. If you’re serving millions of queries, you should overtrain a smaller model — the additional training compute is a one-time investment that pays dividends on every inference forever.

“This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage.”

This observation is precisely why LLaMA (7B on 1T tokens — 143 tokens/param, vs. Chinchilla’s 20), Mistral (7B, pushed further), and Phi (3.8B, extreme overtraining on synthetic data) all deviate from the strict Chinchilla ratio. They’re not ignoring Chinchilla; they’re applying its deeper logic to a different cost function that includes inference.

Does it work? What breaks?

Model	Params	Tokens	MMLU (5-shot)	Hellaswag	vs. Gopher
Megatron-Turing NLG	530B	270B	56.6%	82.4%	-3.4pp MMLU
GPT-3	175B	300B	43.9%	78.9%	-16.1pp MMLU
Gopher	280B	300B	60.0%	79.2%	baseline
Chinchilla	70B	1.4T	67.5%	80.8%	+7.5pp MMLU

“Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.”

A 70B model — 4× fewer parameters than Gopher, 7.5× fewer than MT-NLG — beats all of them at the same compute. The MMLU gap is 7.5 points over Gopher, and Chinchilla sets the new state of the art at 67.5%.

What doesn’t work:

Three failure modes in the Chinchilla framework:

First, the 20 tokens/parameter rule treats data as homogeneous. It isn’t. Phi-1 (1.3B parameters, 7B tokens of GPT-4-generated “textbook quality” data) outperforms models 10-30× its size trained on 300B tokens of raw web text. Data quality interacts with the N:D ratio in ways the parametric loss model doesn’t capture. The law was derived on MassiveText, a filtered web corpus — it’s not universal.

Second, the law predicts pretraining perplexity, not downstream task performance. Some capabilities (multi-step arithmetic, complex reasoning, instruction following) show phase-transition-like jumps that smooth power laws miss entirely. A model on the loss curve can be above or below a capability threshold that matters enormously in practice.

Third, the parametric fits are noisy at the tails. The paper trains models up to 16B parameters on 500B tokens — the largest point. Extrapolating to 100B+ parameters or 10T+ tokens requires trusting that the power law holds outside the observed range. LLaMA-3’s 70B model trained on 15T tokens (214 tokens/param) is far outside Chinchilla’s validated regime. Whether the law still predicts the optimal allocation there is an open question.

So what?

If you’re allocating a training compute budget: use the Chinchilla rule as your starting point. ~20 tokens per parameter for compute-optimal training. A 7B model trains optimally on ~140B tokens; a 70B model on ~1.4T; a 400B model on ~8T. Then apply the deployment adjustment: if you’re serving at scale (millions of queries), bias toward a smaller model with more tokens. The training-to-inference cost ratio determines how far to push beyond the Chinchilla point.

scaling-laws-for-neural-language-models said scale the model. Chinchilla said scale the model and the data, equally. llama-open-efficient-foundation-language-models took the next step: push data even further to optimize for inference economics, not training efficiency. Every “efficient” open-weight model since 2022 is navigating this tradeoff. The Chinchilla paper is the pivot it rotates around.

Chinchilla also bears directly on whether emergent abilities are real or artifacts: if loss follows a smooth power law in compute, and task performance shows sharp jumps, the jumps are more likely a property of the evaluation metric than the underlying capability. Smooth training curves don’t produce genuine phase transitions.

Small model, lots of data, same compute — the 70B that beat the 280B.

Connections

scaling-laws — this paper revises the Kaplan scaling laws; equal N:D scaling replaces the parameter-heavy allocation
compute-optimal-training — the central concept introduced and empirically validated here
pre-training — all findings concern pretraining loss and downstream evaluation of pretrained models
inference-efficiency — Chinchilla’s 4× smaller size enables 4× cheaper inference; motivates inference-optimal overtraining as a deliberate strategy
emergent-behavior — smooth Chinchilla loss curves are evidence against genuine phase transitions in capability acquisition
scaling-laws-for-neural-language-models — the Kaplan et al. 2020 paper this directly revises; Chinchilla corrects the confounded ablation methodology
language-models-are-few-shot-learners — GPT-3 (175B, 300B tokens) is identified as undertrained by a factor of ~4×
llama-open-efficient-foundation-language-models — applies Chinchilla’s reasoning; intentionally overtrains smaller models for deployment efficiency
llama-2-open-foundation-fine-tuned-chat-models — follows Chinchilla data-to-parameter guidance at the 70B scale

Citation

arXiv:2203.15556

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., & Sifre, L. (2022). Training Compute-Optimal Large Language Models. arXiv preprint. https://arxiv.org/abs/2203.15556

ML Wiki

Explorer

Training Compute-Optimal Large Language Models (Chinchilla)

The core idea

Does it work? What breaks?

So what?

Connections

Citation

Graph View

Table of Contents

Backlinks