Compute-Optimal Training

What It Is

Compute-optimal training is the practice of allocating a fixed compute budget across model size (N) and training tokens (D) to minimize the resulting loss. Given a budget C ≈ 6ND FLOPs, you have a choice: train a large model for few steps, or a small model for many steps. Compute-optimal training finds the allocation that gets the lowest loss for the compute spent.

Why It Matters

Most training runs before 2020 used intuition or convention to set model size and training duration. The scaling laws showed this was leaving significant performance on the table. The key finding: for a fixed compute budget, you should train a larger model than you think, and stop before convergence. Convergence is compute-inefficient.

How It Works

From Kaplan et al. (2020): the optimal model size and training data scale as:

$N_{opt} \propto C^{0.73}, D_{opt} \propto C^{0.27}$

This means as compute doubles, roughly 66% of the extra budget should go to a bigger model and 34% to more training tokens. In practice: if your current model has N parameters trained on D tokens, doubling compute optimally means training a ~1.66x larger model on ~1.19x more tokens — not training the same model twice as long.

Chinchilla (Hoffmann et al., 2022) revised this to a more balanced rule: optimal training uses roughly 20 training tokens per parameter ( $D_{opt} \approx 20 N$ ), suggesting the original Kaplan experiments underweighted data.

Key Sources

scaling-laws-for-neural-language-models — the paper that first derived compute-optimal allocation from power laws
emergent-abilities-of-large-language-models — shows where compute-optimal loss predictions break for task performance
adam-a-method-for-stochastic-optimization
scaling-laws-neural-language-models
training-compute-optimal-large-language-models
grpo-deepseekmath-group-relative-policy-optimization
knowledge-distillation-hinton
mixture-of-depths-dynamic-compute-allocation
proximal-policy-optimization
batch-normalization-accelerating-deep-network-training
kto-model-alignment-as-prospect-theoretic-optimization
orpo-monolithic-preference-optimization

ML Wiki

Explorer

Compute-Optimal Training

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Compute-Optimal Training

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks