GPT-4 Technical Report

Concepts: scaling-laws | rlhf | alignment | multimodal-embeddings | emergent-behavior Builds on: training-language-models-to-follow-instructions-with-human-feedback | learning-to-summarize-human-feedback Leads to: llama-2-open-foundation-fine-tuned-chat-models | deepseek-r1-reasoning-via-reinforcement-learning

The GPT-4 Technical Report is notable for what it reveals and, equally, what it deliberately withholds. OpenAI discloses no architecture details, no model size, no training compute — citing competitive and safety reasons. What the paper does give you is a story about predictable scaling, a thorough capability evaluation across professional exams, and an unusually candid description of the RLHF-based safety pipeline. The core claim: GPT-4’s performance was accurately predicted from models trained with 1/10,000th the compute, before the full training run finished.

The core idea

Let’s start with the problem motivating Section 3.

Training GPT-4 costs hundreds of millions of dollars and months of calendar time. Before GPT-4, the standard approach to large training runs was: commit, train, and hope. You’d explore hyperparameters and architectures at small scale, but the jump to full scale always carried uncertainty — what works at 1B parameters might not work at 100B. Debugging a failed training run at GPT-4 scale is catastrophically expensive.

The fix: build infrastructure that scales predictably. If you can accurately forecast a large model’s performance from small cheap experiments, you can invest weeks in low-cost experimentation before committing millions to the full run.

The analogy: Think about predicting adult height from childhood growth charts. At age 8, a doctor can look at your growth curve — how you’ve been tracking over the past few years — and reasonably predict your adult height without waiting for you to turn 20. The relationship between childhood growth rate and adult height follows a consistent curve. GPT-4’s predictable scaling is the same idea: the relationship between training compute and model quality follows a power law, and that curve can be fit from cheap small-scale data points and extrapolated to full scale.

The scaling law:

OpenAI fit a scaling law of the form:

$L (C) = a C^{b} + c$

Where:

$L$ is the final validation loss (lower is better)
$C$ is training compute (normalized so GPT-4 = 1)
$a$ and $b$ are fit from smaller runs using the same methodology
$c$ is the irreducible loss — the floor you approach as $C \to \infty$ , representing the inherent entropy of the data that no model can predict perfectly

They trained a series of smaller models (using at most 1/10,000th of GPT-4’s compute), fit this curve, then extrapolated to $C = 1$ . The prediction matched GPT-4’s actual final loss with high accuracy.

For capability prediction (HumanEval pass rates), they used a related power law:

$- E_{P} [lo g (pass_rate (C))] = α \cdot C^{- k}$

where $P$ is a subset of HumanEval problems, and $α > 0$ , $k > 0$ are constants fit from smaller models. The mean log pass rate follows a power law in compute — so you can predict how many coding problems a model will solve before training it.

PREDICTABLE SCALING WORKFLOW:

Small experiments (1/10,000th compute):
  Run A: C=0.0001 → Loss=3.8
  Run B: C=0.0003 → Loss=3.4
  Run C: C=0.001  → Loss=3.0
  Run D: C=0.01   → Loss=2.7
         |
         | fit: L(C) = a·C^b + c
         | (least-squares on log-log scale)
         v
  Extrapolate to C=1.0 (full GPT-4 compute)
         |
         v
  Predicted: GPT-4 loss ≈ 2.45
         |
  [Commit. Full training run completes months later.]
         v
  Actual: GPT-4 loss ≈ 2.45 ✓

Requirement: same architecture family, same data mix,
same optimization method across all scales.

“A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1,000× – 10,000× less compute.”

What’s clever:

The key instinct is a reframing of what “testing” means at scale. Normally you test a hypothesis by running it. OpenAI’s bet: if scaling laws hold, you can test your methodology cheaply, then commit to scale knowing what you’ll get. This separates the methodology decision (architecture, data mix, optimizer) from the scale decision (how much compute). Validate methodology for thousands of dollars; then commit millions knowing the outcome.

The irreducible loss term $c$ is also non-obvious. Without it, a two-parameter power law $L (C) = a C^{b}$ fits poorly at large scales. Adding $c$ captures the fact that there’s a floor to how low validation loss can go — some irreducible entropy in natural language. Ignoring $c$ leads to systematically optimistic predictions for large compute.

The RLHF safety pipeline:

GPT-4’s architecture is secret, but its alignment pipeline is reasonably transparent.

Pre-train on internet-scale text (Transformer, next-token prediction)
Supervised fine-tuning (SFT) on human-written demonstrations
RLHF: collect preference comparisons → train reward model → optimize policy with PPO
New: Rule-Based Reward Models (RBRMs)

RBRMs are GPT-4 itself used as a zero-shot classifier during fine-tuning. Given a prompt, the model’s response, and a human-written rubric, the RBRM classifies the output as: (a) a refusal in the desired style, (b) a refusal in an undesired style, (c) disallowed content, or (d) a safe non-refusal.

“Our rule-based reward models (RBRMs) are a set of zero-shot GPT-4 classifiers. These classifiers provide an additional reward signal to the GPT-4 policy model during RLHF fine-tuning that targets correct behavior, such as refusing to generate harmful content or not refusing innocuous requests.”

The rubric is the alignment spec. Instead of learning a reward model from preference data alone, you express desired behavior as a written rubric and let the model evaluate itself against it. This is cheaper, more auditable, and more controllable than a learned reward model.

Walkthrough — predicting a capability from compute:

Suppose we have three small models with pass rates on a subset of HumanEval:

Model   Compute C (normalized)   mean log pass-rate
  A          0.001                   -4.2
  B          0.01                    -2.8
  C          0.1                     -1.6

Fit: -E[log(pass_rate)] = α · C^(-k)

Taking logs: log(-E[log(pass_rate)]) = log(α) - k · log(C)

Using A and C:
  log(4.2) = log(α) - k · log(0.001)   →  1.435 = log(α) + 3k
  log(1.6) = log(α) - k · log(0.1)    →  0.470 = log(α) + k

Subtracting: 0.965 = 2k  →  k = 0.483
Then: log(α) = 0.470 - 0.483 = -0.013  →  α = 0.987

Prediction at C=1.0 (full GPT-4):
  -E[log(pass_rate)] = 0.987 · 1.0^(-0.483) = 0.987
  → mean log pass-rate ≈ -0.987
  → pass-rate ≈ e^(-0.987) ≈ 37%

(Real GPT-4 0-shot HumanEval: 67% — the actual fit uses more data points
and tighter compute range, giving much more accurate predictions.)

Results

Benchmark	GPT-4	GPT-3.5	Note
Uniform Bar Exam	298/400 (~90th %ile)	213/400 (~10th %ile)	Simulated exam conditions
LSAT	163 (~88th %ile)	149 (~40th %ile)	—
GRE Verbal	169/170 (~99th %ile)	154/170 (~63rd %ile)	—
SAT Math	700/800 (~89th %ile)	590/800 (~70th %ile)	—
MMLU (5-shot)	86.4%	70.0%	57 academic subjects
HumanEval (0-shot)	67.0%	48.1%	Python coding
GSM-8K (5-shot CoT)	92.0%	57.1%	Grade-school math
HellaSwag (10-shot)	95.3%	85.5%	Commonsense reasoning
Toxic generations	0.73%	6.48%	RealToxicityPrompts, lower is better

On real user prompts: GPT-4 responses preferred over GPT-3.5 in 70.2% of 5,214 head-to-head comparisons. Factuality (internal adversarial evals): GPT-4 scores 19 percentage points higher than the latest GPT-3.5. Disallowed content generation: 82% reduction vs GPT-3.5.

What doesn’t work:

“GPT-4 still is not fully reliable (it ‘hallucinates’ facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts.”

More subtle: RLHF hurts calibration. The base GPT-4 model is well-calibrated — its expressed confidence tracks actual accuracy closely. Post-RLHF, calibration degrades. The model becomes overconfident after alignment fine-tuning. For applications where “I’m not sure” matters, the base model is preferable to the chat model.

The architecture opacity is also a limitation for the research community. We cannot replicate GPT-4, cannot study what specifically changed from GPT-3.5, and cannot attribute the capability jump to any specific design choice.

Practical implications

If you’re scaling a model, build the scaling law first. Run a sweep at 3–5 scales spanning 2–3 orders of magnitude of compute. Fit $L (C) = a C^{b} + c$ . If the curve isn’t smooth — if the fit is poor — your training methodology has a problem that scale will not fix. A predictable scaling curve is evidence your setup is “scale-ready.”

The RBRM approach generalizes. Using your model as a classifier against a human-written rubric is cheaper than training a separate reward model and more interpretable — the rubric IS the alignment spec, and you can audit it directly. Any RLHF application where you can write a rubric (refusals, format constraints, factuality) can potentially use LLM-as-judge instead of trained preference models.

The calibration finding is a warning for production systems. Measure calibration separately on your base model and RLHF-tuned model before deployment. In medical, legal, or financial settings, a model that’s overconfident when wrong is more dangerous than one that hedges appropriately.

GPT-4 connects backward to InstructGPT (training-language-models-to-follow-instructions-with-human-feedback) and the seminal RLHF summarization paper (learning-to-summarize-human-feedback) — it’s essentially the same pipeline, scaled with RBRM augmentation. Forward, it influenced every open-source replication attempt: Llama 2 (llama-2-open-foundation-fine-tuned-chat-models) used similar RLHF methodology to close the gap.

A model whose performance can be predicted from pocket-change experiments — that’s the contribution scaling law research had been building toward.

Connections

scaling-laws — validates power-law scaling; adds capability prediction (HumanEval pass rates from small models)
rlhf — full RLHF pipeline augmented with Rule-Based Reward Models (RBRMs)
alignment — RBRM-based alignment; calibration degradation post-RLHF
multimodal-embeddings — accepts interleaved image and text inputs via vision encoder
emergent-behavior — reverses inverse scaling on Hindsight Neglect; human-level on professional exams
training-language-models-to-follow-instructions-with-human-feedback — InstructGPT: the RLHF pipeline GPT-4 scales up
learning-to-summarize-human-feedback — seminal RLHF paper establishing reward model + PPO approach
llama-2-open-foundation-fine-tuned-chat-models — Meta’s open-source attempt to match GPT-4-level chat with similar RLHF methodology

Citation

arXiv:2303.08774

OpenAI (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774

ML Wiki

Explorer

GPT-4 Technical Report

The core idea

Results

Practical implications

Connections

Citation

Graph View

Table of Contents

Backlinks