Scaling Laws for Neural Language Models

Concepts: scaling laws | emergent behavior | compute-optimal training | power laws Builds on: Attention Is All You Need Leads to: Emergent Abilities of LLMs

Part 1: The problem

Before this paper, training a language model was expensive guesswork. You had a compute budget, you had intuitions about model size, and you had gut feelings about how long to train. Nobody could tell you whether doubling your parameters would be worth more than doubling your data. Nobody could tell you when to stop training. The question “how much compute should go to model size versus training duration?” had no principled answer — just accumulated folklore and expensive ablations that never quite generalized. Every large training run was a bet with millions of dollars at stake and no odds to read.

Part 2: How scaling laws work

The speed camera analogy

Speed cameras don’t measure every car to know traffic patterns. They measure enough cars at enough locations to fit a curve, then predict confidently. The curve is the model.

Kaplan et al. did the same thing with language models. Train hundreds of models across six orders of magnitude in size — from 768 parameters to 1.5 billion. Measure the loss on each. Plot it. What emerges is not noise. It is a straight line on a log-log plot. Every single time.

That straight line on a log-log plot is a power law. And once you know the slope and intercept from small experiments, you can read off the predicted loss for any model size you haven’t yet trained.

This is the core move: small experiments become a telescope into large-scale behavior.

Three clean relationships

The paper establishes three independent power laws, each holding when the other two factors are not the bottleneck:

Loss vs. parameters (N): Train a large enough model on enough data, and the loss follows a smooth curve as you add parameters. The more parameters, the lower the loss — but each doubling buys you less than the last.

Loss vs. dataset size (D): Fix the model, vary the training data. More tokens means lower loss, again following a clean power law. But there is a ceiling: once the model has consumed enough data for its capacity, more data stops helping.

Loss vs. compute (C): Given a fixed compute budget, train the optimal-sized model for the optimal number of steps. The resulting loss follows a power law in the compute spent.

These three laws do not operate in isolation. They interact. The paper’s key insight is that they can be unified: performance is bottlenecked by whichever of the three is smallest relative to the others.

The ASCII picture

Fixed compute budget C
          │
          ▼
  ┌───────────────────────────────┐
  │   How to split C?             │
  │                               │
  │   N (parameters)              │
  │   × D (training tokens)       │
  │   ≈ C / 6                     │
  └───────────────────────────────┘
          │
  Two extreme choices:
          │
  ┌───────┴──────────────────────┐
  │                              │
  ▼                              ▼
Train small model          Train large model
for many steps             for few steps
(many D, small N)          (small D, large N)

Both are suboptimal.
The paper finds the optimal point:
N ∝ C^0.73   (model size grows fast with compute)
D ∝ C^0.27   (data grows slowly with compute)

→ When compute doubles: most extra budget → bigger model
  only a little → more data

The takeaway from the diagram: as you scale compute, spend most of it on a bigger model, not more training steps. This was counterintuitive in 2020 and directly contradicted how most practitioners trained.

The math, with every symbol translated

The three core equations are:

$L (N) = (\frac{N _{c}}{N})^{α_{N}}; α_{N} \approx 0.076, N_{c} \approx 8.8 \times 1 0^{13}$

$L (D) = (\frac{D _{c}}{D})^{α_{D}}; α_{D} \approx 0.095, D_{c} \approx 5.4 \times 1 0^{13}$

$L (C_{m i n}) = (\frac{C _{c}^{m i n}}{C _{m i n}})^{α_{C}^{m i n}}; α_{C}^{m i n} \approx 0.050$

What each piece means:

$L$ is the cross-entropy loss in nats, averaged over a 1024-token context. Lower is better.
$N$ is the number of non-embedding parameters. Embeddings are deliberately excluded — including them muddles the trend.
$D$ is the dataset size in tokens.
$C_{m i n}$ is the minimum compute (in PF-days, where 1 PF-day = $8.64 \times 1 0^{19}$ floating point operations) to achieve a given loss when training optimally.
$N_{c}$ and $D_{c}$ are empirically fitted constants. Their absolute values depend on tokenization and vocabulary size and have no fundamental meaning — only the exponents $α$ matter.
The exponents $α_{N} \approx 0.076$ and $α_{D} \approx 0.095$ are the slopes of the log-log lines. They tell you how fast performance improves as you scale each axis.

For joint dependence on both N and D simultaneously, the paper gives a combined equation:

$L (N, D) = [(\frac{N _{c}}{N})^{\frac{α _{N}}{α _{D}}} + \frac{D _{c}}{D}]^{α_{D}}$

This captures the interaction: a huge model with tiny data is bottlenecked by data, and vice versa.

Numeric walkthrough with real numbers

From the paper (Table 2 and surrounding equations): $α_{N} = 0.076$ , $α_{D} = 0.095$ , $α_{C}^{m i n} = 0.050$ .

What does 10x more parameters buy you?

$\frac{L ( N )}{L ( 10 N )} = (\frac{10 N}{N})^{α_{N}} = 1 0^{0.076} \approx 1.19$

A 10x increase in parameters reduces loss by about 19%. That sounds modest, but loss is measured in nats — even small drops in cross-entropy loss translate to substantially more fluent, coherent text.

What does doubling parameters buy you?

$2^{- α_{N}} = 2^{- 0.076} \approx 0.949$

Doubling parameters multiplies the loss by 0.949 — a 5% reduction. Each successive doubling gives you another 5%.

How does the compute split work?

The paper shows that the optimal model size and data size scale as: $N \propto C^{0.73}, D \propto C^{0.27}$

So for a 10x increase in compute budget:

Optimal model size grows by $1 0^{0.73} \approx 5.4 \times$
Optimal training tokens grow by $1 0^{0.27} \approx 1.9 \times$

The ratio is roughly 3:1 in favor of model size. Train bigger models, not longer ones.

The 20-tokens-per-parameter heuristic:

The paper observes that models smaller than $1 0^{9}$ parameters can be trained on the full 22B token WebText2 dataset without overfitting, and derives that to avoid a penalty:

$D ≳ (5 \times 1 0^{3}) \cdot N^{0.74}$

For a 1 billion parameter model: $D ≳ 5 \times 1 0^{3} \times 1 0^{9 \times 0.74} \approx 5 \times 1 0^{3} \times 3.5 \times 1 0^{6} \approx 1.75 \times 1 0^{10}$ tokens — roughly 17 billion tokens, or about 17 tokens per parameter. This is the empirical origin of the “20 tokens per parameter” rule of thumb.

What is clever here

The insight that makes you stop and think: architectural details barely matter.

The paper ran experiments varying depth, width, number of attention heads, and feedforward dimension — all while holding total parameter count fixed. The result: “performance depends very weakly on other architectural hyperparameters such as depth vs. width.” A model with 6 layers and wide dimensions performs within 3% of a model with 48 layers and narrow dimensions, at the same parameter count.

This means the entire research program of finding the “best architecture” is mostly irrelevant at fixed scale. Scale dominates architecture. You cannot engineer your way past the power law by clever design.

The second clever thing: the laws hold over seven orders of magnitude. From thousands of parameters to billions, the same line fits. There is no sign of curvature at the top end — meaning you could trust the extrapolation to scales the authors never tested.

Direct quotes from the paper

The abstract states directly: “the loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.”

That seven orders of magnitude span is the key empirical fact. Most physical laws break down across such ranges. This one holds.

On the architecture finding: “Other architectural details such as network width or depth have minimal effects within a wide range.”

Translation: if you are spending time hunting for the perfect depth-to-width ratio, you are optimizing the wrong thing. The exponent on parameter count will outweigh any architectural gain.

On training efficiency: “Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.”

Translation: do not wait for convergence. A large model trained partway beats a small model trained to convergence with the same compute budget. This is the result that most surprised practitioners in 2020.

On overfitting: “every time we increase the model size 8x, we only need to increase the data by roughly 5x to avoid a penalty.”

Data requirements grow sublinearly in model size. This means data is not the bottleneck as fast as people feared.

Part 3: Results and what breaks

What the paper reports

Scaling axis	Exponent (α)	Range covered	What 10x gets you
Parameters (N)	0.076	6 orders of magnitude	~19% loss reduction
Dataset size (D)	0.095	2+ orders of magnitude	~24% loss reduction
Compute (C_min)	0.050	8 orders of magnitude	~12% loss reduction

Across all these scales, the power law fits hold with high precision. The training curves for models ranging from $1 0^{3}$ to $1 0^{9}$ parameters all fall on the same predicted trajectory.

The paper also shows that generalization to out-of-distribution text (Books, Wikipedia, Common Crawl) improves smoothly with model size in direct parallel with in-distribution loss — with only “a small and very slowly growing offset from the WebText2 training distribution.”

What does not work

The laws measure cross-entropy loss, not downstream task performance. A 5% drop in loss does not translate to a 5% improvement on question answering. Emergent capabilities — things like multi-step reasoning, arithmetic, instruction following — can appear suddenly above a scale threshold rather than improving smoothly. The Emergent Abilities paper (2022) documents exactly this failure mode of smooth extrapolation.

The laws are Transformer-specific. The paper briefly compares LSTMs — they follow a worse power law, plateauing earlier than Transformers for tokens later in context. Whether the specific exponents transfer to other architectures is an open question.

The laws break down at very small scale (below roughly $1 0^{6}$ parameters) where the fit deteriorates. They also assume a minimum data supply — if you are training a 10B parameter model on 100M tokens, you are in a different regime entirely where overfitting dominates.

Data quality is not in the equation. The laws were measured on WebText2, a reasonably curated web corpus. Training on lower-quality data will shift the constants but the paper provides no guidance on how much.

Finally: the Kaplan et al. scaling laws were partially revised by Chinchilla (Hoffmann et al., 2022), which found that the original experiments confounded model size and data by keeping data too small. The Chinchilla correction changes the compute-optimal N:D ratio from roughly Kaplan’s skew-toward-N regime to a more balanced 1:20 (parameters:tokens). The qualitative insight — that smooth power laws govern performance — survived. The specific allocation advice did not.

Part 4: So what?

If you are building ML systems

Use small-scale experiments to predict large-scale results. The power law means you can train 10M, 50M, and 200M parameter models, measure the loss at each, fit the line, and extrapolate to 10B with reasonable confidence. This converts a single massive training run into a calibration exercise first.

When allocating a fixed compute budget: spend more on model size than training steps. If you have twice the compute, the correct move is to train a roughly 1.6x larger model (since $2^{0.73} \approx 1.66$ ) for only a modest increase in training tokens. Do not simply train the same model longer — that is compute-inefficient.

Stop before convergence. This is the operationally hardest result to accept. Your intuition says to train until the loss plateaus. The scaling laws say that compute is better spent on a larger model trained for fewer steps than a smaller model trained to convergence. The first model that “converges” is the one that wasted your budget.

When evaluating whether to scale: if you are below the compute threshold where emergent capabilities are expected, do not expect smooth gains on evaluations that test those capabilities. The power law predicts loss. It does not predict task performance on evaluations that have phase transitions.

Connections to other work

The Emergent Abilities paper (Wei et al., 2022) builds directly on the scaling intuition here but documents where smooth extrapolation breaks. The Kaplan laws predict smooth loss reduction — but individual capabilities can cross a threshold and jump discontinuously. These two papers are in tension and both right: loss scales smoothly, task performance sometimes does not.

The Adam optimizer underpins essentially all of the scaling law experiments. The paper notes it explicitly: “Unless otherwise noted, we train models with the Adam optimizer.” The scaling results are measured using Adam’s adaptive learning rates — a different optimizer would shift the constants but likely preserve the exponents.

The Chinchilla laws (2022) directly corrected this paper’s compute-optimal allocation advice. If you have heard “20 tokens per parameter,” that is the Chinchilla update — not Kaplan. Kaplan’s exponents suggested training larger, less-data-hungry models; Chinchilla’s more careful joint ablation found that equal scaling of N and D is optimal.

One-liner

If you can draw a straight line on a log-log plot at small scale, you can read off the loss for any model size you will ever train — and that line is the most expensive thing anyone has ever confirmed.

Connections

Emergent Abilities of LLMs — directly builds on the scaling intuitions here, then shows where smooth extrapolation breaks
Adam — the optimizer used in the scaling experiments
scaling laws — the central concept this paper establishes
compute-optimal training — the practical output: how to allocate a fixed compute budget
power laws — the mathematical form the scaling relationships take
emergent behavior — capabilities that violate smooth extrapolation from loss curves

Citation

arXiv:2001.08361

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

ML Wiki

Explorer

Scaling Laws for Neural Language Models - explained