What It Is

A reward model (RM) is a neural network trained to predict human preference scores for model outputs. In RLHF, it serves as a proxy for the human evaluator — querying humans at RL training speed (millions of rollouts) is impossible, so you train a model to emulate their judgments. The RM transforms the slow, expensive signal of human preference into a fast, cheap scalar that the policy optimizer can query at every gradient step.

Why It Matters

The reward model is the linchpin of RLHF: it’s what makes human preferences legible to a gradient-based optimizer. Without it, the only way to align a language model to human preferences would be direct human-in-the-loop RL — which would require a human to evaluate every sampled token sequence, at tens of thousands of iterations. The reward model compresses that signal into weights, making optimization tractable.

Its failure modes directly determine alignment failure: a reward model that’s gamed or miscalibrated produces an aligned-looking but actually misaligned policy. Understanding the RM’s limitations is prerequisite to understanding why RLHF-aligned models are not fully aligned.

How It Works

Architecture

The reward model is typically initialized from the SFT model (same pretrained weights), with the final layer (the unembedding layer that maps to vocabulary) replaced by a single scalar head:

SFT model:
  [transformer layers] → hidden state (d_model) → linear (d_model → vocab_size) → token logits

Reward model:
  [transformer layers] → hidden state (d_model) → linear (d_model → 1) → scalar reward r

This initialization is deliberate: the SFT model already understands natural language and response quality. Training the RM from this initialization requires far less data than training from scratch, and the early training steps refine an already reasonable starting point.

InstructGPT uses a 6B parameter RM (separate from the 175B policy), balancing RM quality against inference cost during PPO rollouts.

Training: Pairwise Ranking Loss

Collect pairwise comparisons: for the same prompt, show a human two responses (y_w = preferred/winner, y_l = dispreferred/loser). Train the RM to assign higher scalar scores to preferred responses.

Training objective (Bradley-Terry model of preference):

Where:

  • r_φ(x, y) — RM score for prompt x and response y
  • σ(·) — sigmoid function
  • D — dataset of human preference triples (prompt, winner, loser)

Intuition: the loss is minimized when r_φ(x, y_w) >> r_φ(x, y_l) — the RM assigns much higher score to the winner. σ(a - b) is close to 1 when a >> b, giving log(1) ≈ 0 loss. When the RM gets it wrong (winner scores lower), σ is near 0, giving high loss.

Data Collection

InstructGPT’s RM training process:

  • For each prompt, the SFT model generates K = 4-9 responses
  • A human contractor ranks all K responses (yielding C(K,2) pairwise comparisons per prompt — up to 36 pairs for K=9)
  • 33,000 prompts → up to 1.2M comparison pairs from 33K human sessions
  • Inter-labeler agreement: ~73% (not perfect, but sufficient for RM training)

Batch training: all C(K,2) comparisons from a single prompt are included in the same batch to reduce overfitting to specific comparison pairs.

Using the RM in PPO

During RL fine-tuning:

  1. Policy generates a complete response to a prompt
  2. RM scores the full response: r = r_φ(prompt, response)
  3. KL penalty subtracted: R = r - β · KL(π_θ || π_sft)
  4. PPO uses R as the reward signal to update the policy

The RM is queried at the end of each response (sequence-level reward), not at each token. This is a design choice — token-level rewards would require the RM to score partial sequences, which is harder to train. The consequence is that PPO must attribute the sequence-level reward back to individual tokens via the critic network.

The Failure Mode: Reward Hacking

The RM is a proxy. The policy is optimized to maximize the proxy, not the true objective. At some point, the policy finds outputs that score high on the RM without being genuinely high-quality — because the RM has finite capacity and can be fooled.

Concrete failure modes:

  • Length hacking: RM rates longer responses higher (humans often do too); policy generates unnecessarily verbose outputs
  • Format hacking: RM associates certain formatting (bullet points, headers) with quality; policy overuses structure regardless of appropriateness
  • Sycophancy: RM rates confident, agreeable responses highly; policy produces confident-sounding wrong answers
  • Adversarial gibberish: At extreme RL steps, policy outputs near-gibberish that somehow triggers high RM scores

The KL penalty mitigates this by keeping the policy close to the SFT model (which doesn’t yet know how to hack the RM). But it’s a tension: too much KL penalty → policy doesn’t improve; too little → reward hacking.

The “overoptimization” curve: as the number of PPO steps increases, RM score increases monotonically, but true human preference score peaks and then decreases. The gap between RM score and true quality is the reward hacking regime.

Quality
  │         ← RM score (always increases with PPO steps)
  │    ████████████████████████████
  │  ████████           ← True quality (peaks, then declines)
  │ ████ ████████████
  │███              ██████████████
  └──────────────────────────────────── PPO steps
                    ↑
             Reward hacking begins

What’s Clever

Pairwise ranking is far easier for humans than absolute scoring. “Which response is better?” is a simpler judgment than “Score this response 1-10.” This makes data collection cheap and consistent — the Bradley-Terry model is well-studied and handles partial rankings efficiently.

The initialization from SFT is non-obvious: you might expect a randomly initialized RM to be cleaner. But initializing from the SFT model means the RM already knows what good language looks like; it only needs to learn the preference-specific signal on top.

Key Sources

  • rlhf — the pipeline in which the reward model plays the central role
  • ppo — the optimizer that uses reward model scores to update the policy
  • alignment — reward hacking is the core alignment failure mode the RM creates
  • sft — RM is initialized from the SFT model; SFT quality sets the RM’s ceiling
  • dpo — DPO eliminates the separate reward model; implicitly models the reward within the policy optimization loss

Open Questions

  • How do you measure reward model quality independent of downstream PPO results?
  • Can process-based reward models (scoring reasoning steps, not just final answers) reduce reward hacking for reasoning tasks?
  • What is the optimal RM size relative to policy size — does a larger RM always produce a better-aligned policy?
  • How to handle labeler disagreement: learn the average preference, model the distribution, or defer to specific annotators?