What It Is
Compute-optimal training is the practice of allocating a fixed compute budget across model size (N) and training tokens (D) to minimize the resulting loss. Given a budget C ≈ 6ND FLOPs, you have a choice: train a large model for few steps, or a small model for many steps. Compute-optimal training finds the allocation that gets the lowest loss for the compute spent.
Why It Matters
Most training runs before 2020 used intuition or convention to set model size and training duration. The scaling laws showed this was leaving significant performance on the table. The key finding: for a fixed compute budget, you should train a larger model than you think, and stop before convergence. Convergence is compute-inefficient.
How It Works
From Kaplan et al. (2020): the optimal model size and training data scale as:
This means as compute doubles, roughly 66% of the extra budget should go to a bigger model and 34% to more training tokens. In practice: if your current model has N parameters trained on D tokens, doubling compute optimally means training a ~1.66x larger model on ~1.19x more tokens — not training the same model twice as long.
Chinchilla (Hoffmann et al., 2022) revised this to a more balanced rule: optimal training uses roughly 20 training tokens per parameter (), suggesting the original Kaplan experiments underweighted data.
Key Sources
-
scaling-laws-for-neural-language-models — the paper that first derived compute-optimal allocation from power laws
-
emergent-abilities-of-large-language-models — shows where compute-optimal loss predictions break for task performance