The Problem

RLHF works: InstructGPT showed you can use human preference labels to make models dramatically more useful. But the pipeline is painful.

You need to:

  1. Train a reward model (another large model, trained on pairwise preferences)
  2. Run PPO against that reward model (unstable, sensitive to hyperparameters, requires 4 models in memory simultaneously: actor, critic, reference, reward)
  3. Tune KL penalty, learning rate, clip ratio, value function weight…

PPO instability isn’t just inconvenient — it means teams at large labs spend weeks debugging RL runs that diverge or reward-hack. Smaller teams often can’t run RLHF at all.

DPO asks: what if we didn’t need the RL step?

The Key Insight

The RLHF objective has a known optimal solution. If you’re running KL-constrained reward maximization:

The optimal policy is:

This is a closed-form expression — you can solve for it analytically. DPO’s authors looked at this and asked: if we know the optimal policy has this form, can we invert it to express the reward as a function of the policy?

Yes. Rearranging:

The reward is just the log ratio of the optimal policy to the reference, plus a term that doesn’t depend on . Now substitute this into the Bradley-Terry preference model used to define human preferences. The terms cancel. What remains is a pure supervised loss.

No reward model. No RL. Just cross-entropy on preference pairs.

Mechanism in Plain English

  1. Start with a reference policy π_ref (the SFT model). Freeze it.
  2. For each preference pair (prompt x, preferred response y_w, rejected response y_l):
    • Compute the log probability of y_w under your current policy π_θ
    • Compute the log probability of y_w under the reference π_ref
    • Do the same for y_l
  3. The training signal: make y_w’s implicit reward (log ratio) higher than y_l’s implicit reward
  4. The β parameter controls how far you’re allowed to deviate from π_ref

This is just binary classification. One forward pass per training example.

ASCII Diagram

  RLHF pipeline (what DPO replaces):
  
  Preference data
       ↓
  Train reward model (RM)  ← 6B separate model
       ↓
  PPO loop:
    [Policy] → generate → [RM] → score
    [Reference] → KL penalty
    [Critic] → value function
       ↓ (repeat 10K+ steps, 4 models in memory)
  Aligned policy
  
  ─────────────────────────────────────────────────
  
  DPO pipeline:
  
  Preference data: (x, y_w, y_l)
       ↓
  For each pair, compute:
  
  log π_θ(y_w|x)    log π_θ(y_l|x)
  log π_ref(y_w|x)  log π_ref(y_l|x)
  
  Implicit reward margin:
  Δ = β[log(π_θ/π_ref)(y_w) − log(π_θ/π_ref)(y_l)]
  
  Loss = −log σ(Δ)    ← just binary cross-entropy
       ↓
  Aligned policy  (2 models in memory: policy + frozen reference)

Math with Translation

  • — probability of the preferred completion under the policy being trained
  • — probability of the preferred completion under the reference (SFT) model
  • — the implicit reward: how much more (or less) the policy favors this completion vs the reference
  • — KL penalty strength; low = aggressive optimization, high = stay close to reference
  • — sigmoid; makes this look like logistic regression
  • The outer term says: make the implicit reward of the winner higher than the loser’s

Gradient intuition: the gradient up-weights and down-weights , with larger updates where the policy currently gets the preference wrong (where is small or negative).

Concrete Walkthrough

Prompt x: “Explain what a neural network is.”

y_w (preferred): “A neural network is a computational model inspired by the brain, organized in layers that transform inputs into outputs through learned weights.”

y_l (rejected): “Neural networks are AI systems that are very powerful and used in many applications today.”

Training step:

  • (sum of log probs over all tokens in )

  • log ratio for : (policy slightly prefers more than reference does)

  • (shorter = higher probability, naive)

  • log ratio for : (policy slightly prefers less than reference does)

With :

Gradient will push to widen this margin: increase the implicit reward gap between y_w and y_l.

What’s Clever

The non-obvious move is realizing the reward model and the policy are the same object under a reparameterization. In standard RLHF you train them separately. DPO collapses this into one model that is simultaneously the policy and the implicit reward model.

This is why the paper’s title is “Your Language Model is Secretly a Reward Model.”

The second clever thing: the partition function cancels. The hardest part of the reward-as-log-ratio expression is β log Z(x) — a normalizing constant that’s intractable to compute. But in the Bradley-Terry model, Z(x) appears identically in the numerator and denominator and drops out. This isn’t an approximation; it’s an exact cancellation. PPO’s instability partly comes from trying to estimate quantities analogous to Z(x) with a value function. DPO sidesteps this entirely.

Key Sources

  • rlhf — DPO is a simplification of RLHF
  • sft — SFT model serves as the reference policy π_ref in DPO
  • alignment — DPO is one of the main approaches to aligning LLMs to human preferences
  • reward-model — DPO eliminates the need for an explicit reward model by making it implicit in the log-ratio

Open Questions

  • Does DPO scale reliably to very large models (>70B)?
  • Can iterative/online DPO (collecting preference data from the current policy) improve over static offline DPO?
  • How does DPO compare to RLHF on complex reasoning tasks?