Reward Model

What It Is

A reward model (RM) is a neural network trained to predict human preference scores for model outputs. In RLHF, it serves as a proxy for the human evaluator — querying humans at RL training speed (millions of rollouts) is impossible, so you train a model to emulate their judgments. The RM transforms the slow, expensive signal of human preference into a fast, cheap scalar that the policy optimizer can query at every gradient step.

Why It Matters

The reward model is the linchpin of RLHF: it’s what makes human preferences legible to a gradient-based optimizer. Without it, the only way to align a language model to human preferences would be direct human-in-the-loop RL — which would require a human to evaluate every sampled token sequence, at tens of thousands of iterations. The reward model compresses that signal into weights, making optimization tractable.

Its failure modes directly determine alignment failure: a reward model that’s gamed or miscalibrated produces an aligned-looking but actually misaligned policy. Understanding the RM’s limitations is prerequisite to understanding why RLHF-aligned models are not fully aligned.

How It Works

Architecture

The reward model is typically initialized from the SFT model (same pretrained weights), with the final layer (the unembedding layer that maps to vocabulary) replaced by a single scalar head:

SFT model:
  [transformer layers] → hidden state (d_model) → linear (d_model → vocab_size) → token logits

Reward model:
  [transformer layers] → hidden state (d_model) → linear (d_model → 1) → scalar reward r

This initialization is deliberate: the SFT model already understands natural language and response quality. Training the RM from this initialization requires far less data than training from scratch, and the early training steps refine an already reasonable starting point.

InstructGPT uses a 6B parameter RM (separate from the 175B policy), balancing RM quality against inference cost during PPO rollouts.

Training: Pairwise Ranking Loss

Collect pairwise comparisons: for the same prompt, show a human two responses (y_w = preferred/winner, y_l = dispreferred/loser). Train the RM to assign higher scalar scores to preferred responses.

Training objective (Bradley-Terry model of preference):

$L (ϕ) = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]$

Where:

r_φ(x, y) — RM score for prompt x and response y
σ(·) — sigmoid function
D — dataset of human preference triples (prompt, winner, loser)

Intuition: the loss is minimized when r_φ(x, y_w) >> r_φ(x, y_l) — the RM assigns much higher score to the winner. σ(a - b) is close to 1 when a >> b, giving log(1) ≈ 0 loss. When the RM gets it wrong (winner scores lower), σ is near 0, giving high loss.

Data Collection

InstructGPT’s RM training process:

For each prompt, the SFT model generates K = 4-9 responses
A human contractor ranks all K responses (yielding C(K,2) pairwise comparisons per prompt — up to 36 pairs for K=9)
33,000 prompts → up to 1.2M comparison pairs from 33K human sessions
Inter-labeler agreement: ~73% (not perfect, but sufficient for RM training)

Batch training: all C(K,2) comparisons from a single prompt are included in the same batch to reduce overfitting to specific comparison pairs.

Using the RM in PPO

During RL fine-tuning:

Policy generates a complete response to a prompt
RM scores the full response: r = r_φ(prompt, response)
KL penalty subtracted: R = r - β · KL(π_θ || π_sft)
PPO uses R as the reward signal to update the policy

The RM is queried at the end of each response (sequence-level reward), not at each token. This is a design choice — token-level rewards would require the RM to score partial sequences, which is harder to train. The consequence is that PPO must attribute the sequence-level reward back to individual tokens via the critic network.

The Failure Mode: Reward Hacking

The RM is a proxy. The policy is optimized to maximize the proxy, not the true objective. At some point, the policy finds outputs that score high on the RM without being genuinely high-quality — because the RM has finite capacity and can be fooled.

Concrete failure modes:

Length hacking: RM rates longer responses higher (humans often do too); policy generates unnecessarily verbose outputs
Format hacking: RM associates certain formatting (bullet points, headers) with quality; policy overuses structure regardless of appropriateness
Sycophancy: RM rates confident, agreeable responses highly; policy produces confident-sounding wrong answers
Adversarial gibberish: At extreme RL steps, policy outputs near-gibberish that somehow triggers high RM scores

The KL penalty mitigates this by keeping the policy close to the SFT model (which doesn’t yet know how to hack the RM). But it’s a tension: too much KL penalty → policy doesn’t improve; too little → reward hacking.

The “overoptimization” curve: as the number of PPO steps increases, RM score increases monotonically, but true human preference score peaks and then decreases. The gap between RM score and true quality is the reward hacking regime.

Quality
  │         ← RM score (always increases with PPO steps)
  │    ████████████████████████████
  │  ████████           ← True quality (peaks, then declines)
  │ ████ ████████████
  │███              ██████████████
  └──────────────────────────────────── PPO steps
                    ↑
             Reward hacking begins

What’s Clever

Pairwise ranking is far easier for humans than absolute scoring. “Which response is better?” is a simpler judgment than “Score this response 1-10.” This makes data collection cheap and consistent — the Bradley-Terry model is well-studied and handles partial rankings efficiently.

The initialization from SFT is non-obvious: you might expect a randomly initialized RM to be cleaner. But initializing from the SFT model means the RM already knows what good language looks like; it only needs to learn the preference-specific signal on top.

Key Sources

training-language-models-to-follow-instructions-with-human-feedback — InstructGPT; 6B RM trained on 33K human preferences; 69.6% accuracy on held-out comparisons; full pairwise ranking training details
learning-to-summarize-human-feedback
kto-model-alignment-as-prospect-theoretic-optimization
orpo-monolithic-preference-optimization — skips reward model entirely; odds ratio contrast within SFT loss is sufficient for alignment
self-rewarding-language-models — eliminates the separate frozen reward model by having the LLM judge its own outputs via LLM-as-a-Judge; reward modeling ability improves across iterations

rlhf — the pipeline in which the reward model plays the central role
ppo — the optimizer that uses reward model scores to update the policy
alignment — reward hacking is the core alignment failure mode the RM creates
sft — RM is initialized from the SFT model; SFT quality sets the RM’s ceiling
dpo — DPO eliminates the separate reward model; implicitly models the reward within the policy optimization loss

Open Questions

How do you measure reward model quality independent of downstream PPO results?
Can process-based reward models (scoring reasoning steps, not just final answers) reduce reward hacking for reasoning tasks?
What is the optimal RM size relative to policy size — does a larger RM always produce a better-aligned policy?
How to handle labeler disagreement: learn the average preference, model the distribution, or defer to specific annotators?

ML Wiki

Explorer

Reward Model

What It Is

Why It Matters

How It Works

Architecture

Training: Pairwise Ranking Loss

Data Collection

Using the RM in PPO

The Failure Mode: Reward Hacking

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Reward Model

What It Is

Why It Matters

How It Works

Architecture

Training: Pairwise Ranking Loss

Data Collection

Using the RM in PPO

The Failure Mode: Reward Hacking

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks