DPO (Direct Preference Optimization)

The Problem

RLHF works: InstructGPT showed you can use human preference labels to make models dramatically more useful. But the pipeline is painful.

You need to:

Train a reward model (another large model, trained on pairwise preferences)
Run PPO against that reward model (unstable, sensitive to hyperparameters, requires 4 models in memory simultaneously: actor, critic, reference, reward)
Tune KL penalty, learning rate, clip ratio, value function weight…

PPO instability isn’t just inconvenient — it means teams at large labs spend weeks debugging RL runs that diverge or reward-hack. Smaller teams often can’t run RLHF at all.

DPO asks: what if we didn’t need the RL step?

The Key Insight

The RLHF objective has a known optimal solution. If you’re running KL-constrained reward maximization:

$max_{π} E [r (x, y)] - β \cdot KL [π ∥ π_{ref}]$

The optimal policy is:

$π^{*} (y ∣ x) \propto π_{ref} (y ∣ x) \cdot exp (\frac{r ( x , y )}{β})$

This is a closed-form expression — you can solve for it analytically. DPO’s authors looked at this and asked: if we know the optimal policy has this form, can we invert it to express the reward as a function of the policy?

Yes. Rearranging:

$r (x, y) = β \cdot lo g \frac{π ^{*} ( y ∣ x )}{π _{ref} ( y ∣ x )} + β \cdot lo g Z (x)$

The reward is just the log ratio of the optimal policy to the reference, plus a term $Z (x)$ that doesn’t depend on $y$ . Now substitute this into the Bradley-Terry preference model used to define human preferences. The $Z (x)$ terms cancel. What remains is a pure supervised loss.

No reward model. No RL. Just cross-entropy on preference pairs.

Mechanism in Plain English

Start with a reference policy π_ref (the SFT model). Freeze it.
For each preference pair (prompt x, preferred response y_w, rejected response y_l):
- Compute the log probability of y_w under your current policy π_θ
- Compute the log probability of y_w under the reference π_ref
- Do the same for y_l
The training signal: make y_w’s implicit reward (log ratio) higher than y_l’s implicit reward
The β parameter controls how far you’re allowed to deviate from π_ref

This is just binary classification. One forward pass per training example.

ASCII Diagram

  RLHF pipeline (what DPO replaces):
  
  Preference data
       ↓
  Train reward model (RM)  ← 6B separate model
       ↓
  PPO loop:
    [Policy] → generate → [RM] → score
    [Reference] → KL penalty
    [Critic] → value function
       ↓ (repeat 10K+ steps, 4 models in memory)
  Aligned policy
  
  ─────────────────────────────────────────────────
  
  DPO pipeline:
  
  Preference data: (x, y_w, y_l)
       ↓
  For each pair, compute:
  
  log π_θ(y_w|x)    log π_θ(y_l|x)
  log π_ref(y_w|x)  log π_ref(y_l|x)
  
  Implicit reward margin:
  Δ = β[log(π_θ/π_ref)(y_w) − log(π_θ/π_ref)(y_l)]
  
  Loss = −log σ(Δ)    ← just binary cross-entropy
       ↓
  Aligned policy  (2 models in memory: policy + frozen reference)

Math with Translation

$L_{DPO} = - E_{(x, y_{w}, y_{l})} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]$

$π_{θ} (y_{w} ∣ x)$ — probability of the preferred completion under the policy being trained
$π_{ref} (y_{w} ∣ x)$ — probability of the preferred completion under the reference (SFT) model
$lo g (π_{θ} / π_{ref})$ — the implicit reward: how much more (or less) the policy favors this completion vs the reference
$β$ — KL penalty strength; low $β$ = aggressive optimization, high $β$ = stay close to reference
$σ$ — sigmoid; makes this look like logistic regression
The outer term $lo g σ (Δ_{w} - Δ_{l})$ says: make the implicit reward of the winner higher than the loser’s

Gradient intuition: the gradient up-weights $y_{w}$ and down-weights $y_{l}$ , with larger updates where the policy currently gets the preference wrong (where $Δ_{w} - Δ_{l}$ is small or negative).

Concrete Walkthrough

Prompt x: “Explain what a neural network is.”

y_w (preferred): “A neural network is a computational model inspired by the brain, organized in layers that transform inputs into outputs through learned weights.”

y_l (rejected): “Neural networks are AI systems that are very powerful and used in many applications today.”

Training step:

$lo g π_{θ} (y_{w} ∣ x) = - 24.3$ (sum of log probs over all tokens in $y_{w}$ )
$lo g π_{ref} (y_{w} ∣ x) = - 25.1$
log ratio for $y_{w}$ : $- 24.3 - (- 25.1) = + 0.8$ (policy slightly prefers $y_{w}$ more than reference does)
$lo g π_{θ} (y_{l} ∣ x) = - 18.2$ (shorter = higher probability, naive)
$lo g π_{ref} (y_{l} ∣ x) = - 17.9$
log ratio for $y_{l}$ : $- 18.2 - (- 17.9) = - 0.3$ (policy slightly prefers $y_{l}$ less than reference does)

With $β = 0.1$ :

$Δ = 0.1 \cdot (0.8 - (- 0.3)) = 0.1 \cdot 1.1 = 0.11$
$Loss = - lo g σ (0.11) = - lo g (0.527) \approx 0.64$

Gradient will push to widen this margin: increase the implicit reward gap between y_w and y_l.

What’s Clever

The non-obvious move is realizing the reward model and the policy are the same object under a reparameterization. In standard RLHF you train them separately. DPO collapses this into one model that is simultaneously the policy and the implicit reward model.

This is why the paper’s title is “Your Language Model is Secretly a Reward Model.”

The second clever thing: the partition function cancels. The hardest part of the reward-as-log-ratio expression is β log Z(x) — a normalizing constant that’s intractable to compute. But in the Bradley-Terry model, Z(x) appears identically in the numerator and denominator and drops out. This isn’t an approximation; it’s an exact cancellation. PPO’s instability partly comes from trying to estimate quantities analogous to Z(x) with a value function. DPO sidesteps this entirely.

Key Sources

direct-preference-optimization-your-language-model-is-secretly-a-reward-model — Introduces DPO
kto-model-alignment-prospect-theoretic-optimization
kto-model-alignment-as-prospect-theoretic-optimization
orpo-monolithic-preference-optimization — replaces DPO probability ratio + reference model with odds ratio folded into SFT; one-phase alternative
self-rewarding-language-models — uses DPO as the backbone for iterative self-improvement; iterative DPO with AI-generated preference pairs enables the reward model itself to improve

rlhf — DPO is a simplification of RLHF
sft — SFT model serves as the reference policy π_ref in DPO
alignment — DPO is one of the main approaches to aligning LLMs to human preferences
reward-model — DPO eliminates the need for an explicit reward model by making it implicit in the log-ratio

Open Questions

Does DPO scale reliably to very large models (>70B)?
Can iterative/online DPO (collecting preference data from the current policy) improve over static offline DPO?
How does DPO compare to RLHF on complex reasoning tasks?

ML Wiki

Explorer

DPO (Direct Preference Optimization)

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

DPO (Direct Preference Optimization)

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks