Scaling Laws

What It Is

Scaling laws are empirical power-law relationships between model performance (loss) and three quantities: number of parameters (N), training tokens (D), and compute budget (C ≈ 6ND FLOPs). They predict how much better a model will get as you scale any of these axes — and crucially, what the optimal allocation of a fixed compute budget is between model size and training data.

Why It Matters

Scaling laws turned LLM development from art into engineering. If you know the relationship, you can predict optimal model size and training duration for a given compute budget before running the experiment. This is worth billions of dollars in avoided wasted compute. Chinchilla (2022) revised the Kaplan (2020) laws and showed that frontier labs had been training models that were too large and undertrained by a factor of 4-8×, wasting enormous compute on suboptimal configurations.

The Key Results

Kaplan et al. (2020) — “Neural Scaling Laws”

$L (N, D) \propto N^{- 0.076} \cdot D^{- 0.095}$

Loss improves as a power law in both model size and data.
The N exponent (-0.076) is larger in magnitude than D (-0.095), suggesting that if forced to choose, scaling parameters helps more than scaling data.
Compute-optimal under Kaplan: For a fixed compute budget, scale N faster than D. This led to very large, data-undertrained models — GPT-3 style (175B parameters, only 300B tokens ≈ 1.7 tokens/parameter).

Chinchilla (Hoffmann et al., 2022) — “Training Compute-Optimal LLMs”

The key correction: Kaplan kept D fixed while varying N, making data look cheap. Chinchilla varied both jointly and found the optimal N:D ratio.

Chinchilla result: For compute-optimal training, scale N and D equally. Specifically:

$N_{opt} \approx \frac{C}{6 D _{opt}}, D_{opt} \approx 20 \cdot N$

Roughly 20 tokens of training data per parameter is optimal.
A 70B model is compute-optimal at ~1.4 trillion tokens.
Chinchilla (70B, 1.4T tokens) matches Gopher (280B, 300B tokens) at 4× fewer parameters.

Kaplan:     bigger model > more data  (GPT-3 style: 175B, 300B tokens)
Chinchilla: equal scaling of both     (LLaMA style: 7B-70B, 1-2T tokens)

GPT-3 trained:  175B params × 300B tokens = undertrained by ~4×
Chinchilla:      70B params × 1.4T tokens = compute-optimal

The Math in Plain English

Total compute C ≈ 6ND (each of N parameters touched ~6 times per token during forward + backward pass). For a fixed C:

Making N twice as large requires halving D — fewer training steps, worse generalization.
Making N half as large lets you double D — more training steps, better generalization.

The Chinchilla finding: at the optimal point, both exponents happen to be equal, so the loss contours are symmetric in log(N) and log(D).

Numeric Example

Training budget: 10²³ FLOPs (roughly $2M at 2023 A100 prices).

Under Kaplan:     N ≈ 10B params, D ≈ 200B tokens
Under Chinchilla: N ≈ 5B params,  D ≈ 400B tokens

Chinchilla model achieves lower validation loss.

What Happens When You Violate Scaling Laws

Overparametrized, undertrained (GPT-3 style):

Model has excess capacity but hasn’t seen enough data to fill it.
Loss is higher than optimal for the compute spent.
Still useful at inference time — you can keep training cheaply.

Underparametrized, overtrained (LLaMA style at extreme):

Model has seen enormous data but lacks capacity to compress it.
Per-token inference cost is low (smaller model), but loss ceiling is higher.
LLaMA-1 7B trained on 1T tokens — compute-suboptimal for training, but optimal for inference cost because a small model is cheap to deploy.

This is the key practical refinement: Chinchilla gives compute-optimal training, but deployment-optimal training should overtrain smaller models because inference cost is paid repeatedly while training is one-time.

Implications for Fine-Tuning

Scaling laws are measured on pretraining loss, not fine-tuning performance. Key observations:

Larger pretrained models fine-tune better — they have more representational capacity for the fine-tuning task.
Fine-tuning data requirements don’t follow the same N:D ratio — even 1,000 high-quality SFT examples change behavior substantially.
RLHF and DPO improvements are not captured in pretraining loss metrics — alignment is orthogonal to the scaling law regime.

What’s Clever

Loss follows a power law, which means a straight line on a log-log plot. You can measure this at small scale (1M-10M parameter models) and extrapolate to large scale with high confidence — without running the large experiment first. This made compute-optimal training tractable. The Chinchilla correction was methodologically simple: Kaplan’s experiments kept data fixed while varying model size, inadvertently confounding the two variables.

Common misconception: scaling laws predict that bigger is always better. They don’t — they predict that compute-efficient allocation is better. A 7B model trained on 1T tokens can outperform a 65B model trained on 100B tokens on many benchmarks.

Key Sources

language-models-are-unsupervised-multitask-learners — GPT-2; early empirical scaling signal (117M→1.5B shows clean perplexity reduction)
scaling-laws-for-neural-language-models — Kaplan et al. 2020; original power-law fits for N, D, C
training-compute-optimal-large-language-models — Chinchilla (Hoffmann et al. 2022); corrects N:D ratio to ~1:20
emergent-abilities-of-large-language-models — emergent abilities as discontinuities that smooth scaling law curves miss
dit-scalable-diffusion-models-with-transformers — DiT: scaling laws apply to image generation (FID as power law of GFLOPs)
language-models-are-few-shot-learners — GPT-3; the model that scaling laws predicted and shaped
scaling-laws-neural-language-models
switch-transformer-sparse-mixture-of-experts
codex-evaluating-large-language-models-trained-on-code
gpt-4-technical-report
t5-exploring-the-limits-of-transfer-learning
gemini-1-5-multimodal-long-context
phi-3-technical-report

emergent-abilities — capabilities that appear suddenly above scale thresholds, violating smooth extrapolation
in-context-learning — few-shot capability whose scaling properties are non-monotonic
grokking — phase transitions in learning that scaling curves can’t predict
emergent-behavior — sharp capability thresholds that smooth loss curves miss
sft — fine-tuning behavior doesn’t follow pretraining scaling laws in a simple way

Open Questions

Do scaling laws hold for reasoning-intensive tasks, or do capability thresholds dominate?
What is the compute-optimal regime when inference cost is amortized over many uses?
Do scaling laws transfer across architectures, or are they Transformer-specific?
How do data quality improvements interact with the N:D ratio?

ML Wiki

Explorer

Scaling Laws

What It Is

Why It Matters

The Key Results

Kaplan et al. (2020) — “Neural Scaling Laws”

Chinchilla (Hoffmann et al., 2022) — “Training Compute-Optimal LLMs”

The Math in Plain English

Numeric Example

What Happens When You Violate Scaling Laws

Implications for Fine-Tuning

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Scaling Laws

What It Is

Why It Matters

The Key Results

Kaplan et al. (2020) — “Neural Scaling Laws”

Chinchilla (Hoffmann et al., 2022) — “Training Compute-Optimal LLMs”

The Math in Plain English

Numeric Example

What Happens When You Violate Scaling Laws

Implications for Fine-Tuning

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks