What It Is
Scaling laws are empirical power-law relationships between model performance (loss) and three quantities: number of parameters (N), training tokens (D), and compute budget (C ≈ 6ND FLOPs). They predict how much better a model will get as you scale any of these axes — and crucially, what the optimal allocation of a fixed compute budget is between model size and training data.
Why It Matters
Scaling laws turned LLM development from art into engineering. If you know the relationship, you can predict optimal model size and training duration for a given compute budget before running the experiment. This is worth billions of dollars in avoided wasted compute. Chinchilla (2022) revised the Kaplan (2020) laws and showed that frontier labs had been training models that were too large and undertrained by a factor of 4-8×, wasting enormous compute on suboptimal configurations.
The Key Results
Kaplan et al. (2020) — “Neural Scaling Laws”
- Loss improves as a power law in both model size and data.
- The N exponent (-0.076) is larger in magnitude than D (-0.095), suggesting that if forced to choose, scaling parameters helps more than scaling data.
- Compute-optimal under Kaplan: For a fixed compute budget, scale N faster than D. This led to very large, data-undertrained models — GPT-3 style (175B parameters, only 300B tokens ≈ 1.7 tokens/parameter).
Chinchilla (Hoffmann et al., 2022) — “Training Compute-Optimal LLMs”
The key correction: Kaplan kept D fixed while varying N, making data look cheap. Chinchilla varied both jointly and found the optimal N:D ratio.
Chinchilla result: For compute-optimal training, scale N and D equally. Specifically:
- Roughly 20 tokens of training data per parameter is optimal.
- A 70B model is compute-optimal at ~1.4 trillion tokens.
- Chinchilla (70B, 1.4T tokens) matches Gopher (280B, 300B tokens) at 4× fewer parameters.
Kaplan: bigger model > more data (GPT-3 style: 175B, 300B tokens)
Chinchilla: equal scaling of both (LLaMA style: 7B-70B, 1-2T tokens)
GPT-3 trained: 175B params × 300B tokens = undertrained by ~4×
Chinchilla: 70B params × 1.4T tokens = compute-optimal
The Math in Plain English
Total compute C ≈ 6ND (each of N parameters touched ~6 times per token during forward + backward pass). For a fixed C:
- Making N twice as large requires halving D — fewer training steps, worse generalization.
- Making N half as large lets you double D — more training steps, better generalization.
The Chinchilla finding: at the optimal point, both exponents happen to be equal, so the loss contours are symmetric in log(N) and log(D).
Numeric Example
Training budget: 10²³ FLOPs (roughly $2M at 2023 A100 prices).
Under Kaplan: N ≈ 10B params, D ≈ 200B tokens
Under Chinchilla: N ≈ 5B params, D ≈ 400B tokens
Chinchilla model achieves lower validation loss.
What Happens When You Violate Scaling Laws
Overparametrized, undertrained (GPT-3 style):
- Model has excess capacity but hasn’t seen enough data to fill it.
- Loss is higher than optimal for the compute spent.
- Still useful at inference time — you can keep training cheaply.
Underparametrized, overtrained (LLaMA style at extreme):
- Model has seen enormous data but lacks capacity to compress it.
- Per-token inference cost is low (smaller model), but loss ceiling is higher.
- LLaMA-1 7B trained on 1T tokens — compute-suboptimal for training, but optimal for inference cost because a small model is cheap to deploy.
This is the key practical refinement: Chinchilla gives compute-optimal training, but deployment-optimal training should overtrain smaller models because inference cost is paid repeatedly while training is one-time.
Implications for Fine-Tuning
Scaling laws are measured on pretraining loss, not fine-tuning performance. Key observations:
- Larger pretrained models fine-tune better — they have more representational capacity for the fine-tuning task.
- Fine-tuning data requirements don’t follow the same N:D ratio — even 1,000 high-quality SFT examples change behavior substantially.
- RLHF and DPO improvements are not captured in pretraining loss metrics — alignment is orthogonal to the scaling law regime.
What’s Clever
Loss follows a power law, which means a straight line on a log-log plot. You can measure this at small scale (1M-10M parameter models) and extrapolate to large scale with high confidence — without running the large experiment first. This made compute-optimal training tractable. The Chinchilla correction was methodologically simple: Kaplan’s experiments kept data fixed while varying model size, inadvertently confounding the two variables.
Common misconception: scaling laws predict that bigger is always better. They don’t — they predict that compute-efficient allocation is better. A 7B model trained on 1T tokens can outperform a 65B model trained on 100B tokens on many benchmarks.
Key Sources
- scaling-laws-neural-language-models — Kaplan et al. 2020; original power-law fits for N, D, C
- training-compute-optimal-large-language-models — Chinchilla (Hoffmann et al. 2022); corrects N:D ratio to ~1:20
- emergent-abilities-of-large-language-models — emergent abilities as discontinuities that smooth scaling law curves miss
- language-models-are-few-shot-learners — GPT-3; the model that scaling laws predicted and shaped
Related Concepts
- emergent-abilities — capabilities that appear suddenly above scale thresholds, violating smooth extrapolation
- in-context-learning — few-shot capability whose scaling properties are non-monotonic
- grokking — phase transitions in learning that scaling curves can’t predict
- emergent-behavior — sharp capability thresholds that smooth loss curves miss
- sft — fine-tuning behavior doesn’t follow pretraining scaling laws in a simple way
Open Questions
- Do scaling laws hold for reasoning-intensive tasks, or do capability thresholds dominate?
- What is the compute-optimal regime when inference cost is amortized over many uses?
- Do scaling laws transfer across architectures, or are they Transformer-specific?
- How do data quality improvements interact with the N:D ratio?