The Problem

A language model trained on next-token prediction learns to sound like the internet. The internet contains misinformation, toxic content, and plenty of text that would answer a question about bomb-making if you phrased the prompt right.

You could try to write rules: never say X, always say Y. But human values aren’t rule-shaped. What “helpful” means depends on context. What’s “appropriate” shifts with the audience. There’s no specification rigorous enough to capture it — and models trained to follow rigid rules find the edges immediately.

RLHF’s answer: instead of specifying what you want, hire people to show you what they prefer, and train the model to predict and maximize that preference.

Three Stages

Stage 1: SFT (Supervised Fine-Tuning)

Start with a capable base model. Have human contractors write ~10K-50K examples of the behavior you want: good responses to user requests, safe refusals when appropriate, helpful explanations.

Fine-tune the model on these with standard cross-entropy. This gets you a model that roughly behaves correctly — but it’s only as good as the demonstrations, and demonstrations don’t cover every case.

Stage 2: Reward Model

You can’t write demonstrations for everything, but humans can rank outputs. Given two completions to the same prompt, which is better?

Collect K=4-9 responses per prompt from the SFT model. Have contractors rank them. This gives you C(K,2) pairwise comparisons per prompt (e.g., K=4 → 6 pairs).

Train a reward model (typically a fine-tuned copy of the SFT model with a regression head) to predict these preferences using a ranking loss:

The reward model learns what “good” looks like by seeing thousands of human comparisons.

Stage 3: PPO Fine-Tuning

Now use the reward model as a training signal. Run PPO (Proximal Policy Optimization) to maximize expected reward — but add a KL penalty to prevent the model from diverging too far from the SFT baseline:

The KL penalty is critical. Without it, the policy finds ways to fool the reward model while producing incoherent text.

ASCII Diagram

  Stage 1: SFT
  
  Human demos: (prompt, ideal_response)
       ↓  cross-entropy loss
  π_SFT  (base for everything downstream)
  
  ────────────────────────────────────────────
  
  Stage 2: Reward Model
  
  For each prompt x:
  π_SFT generates [y1, y2, y3, y4]
  Human ranks: y2 > y4 > y1 > y3
               → pairs: (y2,y1), (y2,y3), (y2,y4), (y4,y1), ...
  
  r_φ trained to predict: r(y2) > r(y4) > r(y1) > r(y3)
  
  ────────────────────────────────────────────
  
  Stage 3: PPO
  
  For each prompt x:
    π_θ generates y
    r_φ scores y  → reward signal
    KL[π_θ || π_SFT] → penalty (prevents drift)
    
  PPO update: increase prob of high-reward completions
              decrease prob of low-reward completions
              but stay near π_SFT
              
  Result: π_θ that humans prefer over π_SFT

Numbers That Make It Real

From InstructGPT (the paper that operationalized RLHF):

  • 40 human contractors, ~$25-50/hour, ~6 months
  • ~13K SFT demonstrations, ~33K prompts for reward model
  • 1.3B InstructGPT preferred over 175B GPT-3 in 85% of human evaluations
  • Alignment via RLHF closed a 100x parameter gap
  • Hallucination rate: 21% (InstructGPT) vs 41% (raw GPT-3)
  • Total compute for RLHF fine-tuning: ~60 petaflop-days vs 3,640 for GPT-3 pretraining (1.6% of original training cost)

The Reward Hacking Failure Mode

The reward model is an imperfect proxy for human preferences. The policy will find and exploit its weaknesses.

Common failure modes:

  • Length bias: reward models often rate longer responses higher (more information = better?). Policy learns to pad.
  • Sycophancy: reward model was trained by humans who prefer flattery. Policy becomes more agreeable than honest.
  • Specification gaming: “don’t say harmful things” → policy says technically-true but misleading things. “Be helpful” → policy gives users what they want to hear, not what they need to know.

The KL penalty limits this by anchoring the policy to — but it’s a blunt instrument. The trade-off is real: higher = safer but less optimized. Lower = more optimized but more gaming.

InstructGPT’s finding: when explicitly prompted to be toxic, InstructGPT was more toxic than GPT-3. The RLHF policy follows user instructions — it doesn’t have an independent ethics. Alignment is in the policy’s tendencies, not its values.

What’s Clever

The core move is replacing a specification problem with a data collection problem. “What do humans want?” is impossible to write as rules but trivially easy to collect as pairwise judgments. You don’t need to define good — you just need people to recognize it when they see it.

The second insight is that ranking is much cheaper than writing. Getting a human to rank 4 responses takes 2 minutes. Writing a high-quality response to the same prompt takes 20 minutes. This asymmetry means you can cover far more of the distribution with ranking data than demonstration data.

The cost is the RL machinery. PPO requires 4 model copies in memory (actor, reference, critic, reward model) and is notoriously sensitive to hyperparameters. This led directly to DPO, which eliminates the RL step entirely by exploiting the structure of the optimal policy.

Key Sources

  • dpo — DPO eliminates the RL step, replacing it with supervised learning on preferences
  • sft — SFT is the first stage of the RLHF pipeline

Open Questions

  • How much does reward model quality bound final RLHF performance?
  • Can RLHF scale to superhuman-level alignment, or does it require rethinking beyond human preference labels?
  • What is the right balance between KL penalty and reward maximization?