KTO: Model Alignment as Prospect Theoretic Optimization

Concepts: alignment | dpo | rlhf | reward-model Builds on: direct-preference-optimization-your-language-model-is-secretly-a-reward-model | training-language-models-to-follow-instructions-with-human-feedback Leads to: downstream alignment research on binary feedback and HALOs

The problem

Aligning LLMs with human preferences has required one very specific and expensive type of data: pairs of responses where a human picks which one they prefer. To collect this data, you have to show someone two model outputs for the same prompt and ask “which is better?” — every single time. That’s slow, expensive, and bottlenecks how fast you can iterate on alignment.

Worse, there’s a deeper assumption buried in this whole approach: the only valid signal is ranked preferences. But most human feedback in the wild isn’t a ranked pair. It’s a thumbs up or thumbs down. “This response is good.” “That one is bad.” Simple binary judgments. Why can’t we just use that?

It turns out we can — and the reason why illuminates something surprising about why DPO and RLHF work in the first place.

The core idea

The analogy first. Imagine you’re learning to cook, and you have two ways to get feedback. The first: a food critic who always eats two dishes side by side and tells you “dish A was better than dish B.” The second: regular diners who just give you a thumbs up or thumbs down after each meal. The first critic is expensive to hire and you can only afford a few sessions. The diners are everywhere — you could get hundreds of reactions per night.

The obvious objection: surely ranked feedback is richer? The critic tells you more. But here’s the catch: if your cooking is inconsistent — some diners love spice, others hate it — the critic’s preference tells you about the food and about the critic’s taste that day. Binary feedback from many diners tells you what most people actually want. For noisy, messy real-world data, the simpler signal wins.

KTO (Kahneman-Tversky Optimization) is the alignment equivalent of switching from the food critic to the crowd of diners.

How the authors got there. Before deriving KTO, the paper makes a genuinely interesting theoretical observation. The authors noticed that DPO and PPO-Clip — the two dominant alignment methods — share a hidden structural property. Both implicitly model human decision-making as having two key features: a reference point (humans judge things relative to something, not in absolute terms) and loss aversion (humans feel losses more sharply than equivalent gains).

“We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases—the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call human-aware losses (HALOs).”

Translation: DPO doesn’t just happen to work. It works because it accidentally mirrors how humans psychologically process outcomes. The authors formalize this into a class called HALOs (Human-Aware Losses).

This comes from prospect theory — a framework from behavioral economics developed by Kahneman and Tversky to explain why humans make decisions that don’t maximize expected value. The key insight: humans don’t evaluate outcomes in absolute terms. They evaluate them relative to a reference point, and the pain of a loss outweighs the pleasure of an equivalent gain. Empirically, the loss aversion coefficient $λ \approx 2.25$ — a $100 l oss h u r t sro ug h l y 2.25 \times m ore t hana$ 100 gain feels good.

The mechanism in plain English.

In DPO, you need a pair of responses $(y_{w}, y_{l})$ for the same input $x$ — a preferred winner and a rejected loser. The loss maximizes the margin between them:

$L_{DPO} = E_{x, y_{w}, y_{l}} [- lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})]$

KTO breaks this pairing requirement. You just need individual $(x, y)$ examples tagged as desirable or undesirable. The loss is:

$L_{KTO} (π_{θ}, π_{ref}) = E_{x, y \sim D} [λ_{y} - v (x, y)]$

where:

$r_{θ} (x, y) = lo g \frac{π _{θ} ( y ∣ x )}{π _{ref} ( y ∣ x )}, z_{0} = KL (π_{θ} (y^{'} ∣ x) ∥ π_{ref} (y^{'} ∣ x))$

$v (x, y) = {λ_{D} σ (β (r_{θ} (x, y) - z_{0})) λ_{U} σ (β (z_{0} - r_{θ} (x, y))) if y is desirable if y is undesirable$

Let’s unpack each piece:

$r_{θ} (x, y)$ — the implied reward: how much more likely is the current policy $π_{θ}$ to generate $y$ given $x$ , compared to the reference model $π_{ref}$ ? Positive means the current policy has moved toward this output; negative means it has moved away.
$z_{0}$ — the reference point: the average implied reward over mismatched (input, output) pairs in the same batch. This is the KT prospect theory term — humans judge outputs relative to what they could have gotten, not in absolute terms.
$σ (\cdot)$ — the logistic function, standing in for the Kahneman-Tversky value function. Concave in gains (diminishing sensitivity to further improvement), convex in losses (more sensitive near the reference point).
$λ_{D}$ , $λ_{U}$ — loss aversion hyperparameters for desirable and undesirable examples respectively.

For desirable outputs, KTO tries to increase $r_{θ} (x, y) - z_{0}$ : the model should rate this output higher than the reference point. For undesirable outputs, it tries to decrease $r_{θ} (x, y)$ below the reference point. Each example is trained independently — no pairing needed.

ASCII diagram.

DPO (requires paired data):
  Input x ──→ [Response yw (preferred)]  ─┐
           └→ [Response yl (rejected)]   ─┴── margin loss: push yw up, yl down
  One training example = 2 outputs, must be for same x

KTO (only needs binary labels):
  Input x1 ──→ [Response y1 (desirable ✓)]  → push r_θ(x1,y1) > z0
  Input x2 ──→ [Response y2 (undesirable ✗)]→ push r_θ(x2,y2) < z0
  Input x3 ──→ [Response y3 (desirable ✓)]  → push r_θ(x3,y3) > z0
  One training example = 1 output with a thumbs up/down

  z0 = average log-ratio over MISMATCHED pairs in the same microbatch
       (x1,y2), (x2,y3), (x3,y1) → estimates KL divergence from ref model

Numeric walkthrough.

Let’s trace through one KTO update step. We have a microbatch with 3 examples:

Microbatch:
  (x1, y1, desirable)    r_θ(x1,y1) = +0.4   [model has learned to rate this higher]
  (x2, y2, undesirable)  r_θ(x2,y2) = +0.1   [model hasn't moved away yet]
  (x3, y3, desirable)    r_θ(x3,y3) = -0.2   [model still underrates this good output]

Step 1: Compute z0 from mismatched pairs (x1,y2), (x2,y3), (x3,y1)
  log π_θ(y2|x1)/π_ref(y2|x1) = -0.3   (y2 is a bad response to x1, model downgrades it)
  log π_θ(y3|x2)/π_ref(y3|x2) = +0.2
  log π_θ(y1|x3)/π_ref(y1|x3) = -0.1
  z0 = max(0, mean(-0.3, +0.2, -0.1)) = max(0, -0.067) = 0.0

Step 2: Compute v(x,y) for each example (β=0.1, λ_D=λ_U=1)
  v(x1,y1) = σ(0.1 × (0.4 - 0.0)) = σ(0.04) = 0.510   [desirable, above ref point]
  v(x2,y2) = σ(0.1 × (0.0 - 0.1)) = σ(-0.01) = 0.498  [undesirable: want z0 - r_θ > 0]
  v(x3,y3) = σ(0.1 × (-0.2 - 0.0)) = σ(-0.02) = 0.495 [desirable but below ref point]

Step 3: Loss = mean(λ_y - v(x,y))
  Example 1: 1.0 - 0.510 = 0.490  → gradient pushes r_θ(x1,y1) higher ✓
  Example 2: 1.0 - 0.498 = 0.502  → gradient pushes r_θ(x2,y2) lower ✓
  Example 3: 1.0 - 0.495 = 0.505  → gradient pushes r_θ(x3,y3) higher ✓
  Total loss: 0.499

The KL estimate $z_{0} = 0$ here (clamped) because the mismatched pairs already average below zero — the model isn’t systematically diverging from the reference yet. As training proceeds and the model gets more aligned, $z_{0}$ will rise and act as a tighter baseline.

What’s clever about this.

The non-obvious insight is about what DPO is actually doing when it works. The paper proves (Theorem 4.2) that two reward functions in the same equivalence class — differing only by an input-specific constant — induce the same optimal policy under the RLHF objective and the same Bradley-Terry preference distribution, but different human utility distributions.

“Maximizing preference likelihood does not mean one is maximizing human utility.”

Translation: DPO can get the right behavior while being off on what it thinks the human actually values. KTO directly optimizes for human utility using the Kahneman-Tversky value function — closer to what we actually want to maximize.

The second clever bit: KTO has better worst-case behavior on noisy data. When two people give contradictory preferences over the same pair, DPO can actually end up preferring the minority-preferred output under certain conditions. KTO with a loss-neutral value function ( $λ_{D} = λ_{U}$ ) provably produces the majority-preferred output (Theorem 4.3). Real-world preference datasets are noisy — this matters.

“Since real-world feedback is very noisy, the reason a desirable example has a highly negative implied reward may be because it is mislabeled. By avoiding this hard-to-learn data, KTO avoids fitting to noise.”

Translation: when an example looks extremely easy or extremely hard relative to the current model, KTO’s logistic value function saturates and effectively ignores it. Built-in noise robustness.

Does it actually work? What breaks?

Core results (Zephyr-β-SFT on UltraFeedback, 1 epoch):

Method	MMLU	GSM8K	HumanEval	BBH
SFT baseline	57.2	39.0	30.1	46.3
DPO	58.2	40.0	30.1	44.1
KTO (β=0.1, λ_D=1)	58.6	53.5	30.9	52.6

The GSM8K jump — from 40.0 (DPO) to 53.5 (KTO) — is striking. A 13.5-point improvement on mathematical reasoning just from switching from DPO to KTO on the same data. On Mistral-7B aligned with one-y-per-x KTO (truly unpaired data), the method still beats DPO with 72% less training data:

Method	Winrate vs SFT target
Mistral-7B + DPO	0.600 ± 0.037
Mistral-7B + KTO (all y per x)	0.652 ± 0.036
Mistral-7B + KTO (one y per x)	0.631 ± 0.036

What doesn’t work.

Every structural element of KTO matters — ablations reveal fragility. Removing the reference point $z_{0}$ drops BBH by 4.0 points and GSM8K by 3.6. Making the value function concave everywhere (like DPO’s log-sigmoid) loses 9.4 points on BBH. Setting it to the identity function (risk-neutral) causes BBH to collapse entirely: 6.1 vs 52.6.

KTO is also more sensitive to learning rate than DPO. The optimal LR is typically 2-10× larger than DPO’s default of 5e-7 — you need 5e-6 as a starting point. Get this wrong and performance degrades significantly.

At very small model scales (Pythia 1.4B-2.8B), the advantage over DPO disappears. A minimum model capacity seems necessary for the difference to emerge. The paper uses Llama and Pythia families up to 30B — nothing at GPT-4 scale is tested.

The authors are also honest that the Kahneman-Tversky value function was calibrated for monetary gambles, not text quality assessments. The correspondence is motivated but not proven:

“KTO is based on the Kahneman-Tversky value function for monetary gambles, which is almost certainly different from how humans perceive the relative goodness of text.”

So what?

If you’re building ML systems that require alignment, KTO is worth reaching for first when your feedback data is naturally binary — user ratings, thumbs up/down from deployed systems, automated classifiers, or rule-based reward signals. You don’t need to force your feedback into a paired format. The paper shows you can use a 9:1 ratio of undesirable to desirable examples and still outperform DPO — set $λ_{D} \approx 13.3$ and $λ_{U} = 1$ to compensate for the imbalance.

When feedback is already in paired preference format, the choice is less obvious. If your data is clean and consistent (synthetic, carefully curated), DPO may be the safer bet — KTO can underfit on clean data where every example is informative. If your data is messy and real-world (crowd-sourced, mixed annotators, annotation contradictions), KTO’s noise robustness likely wins.

This paper connects directly to direct-preference-optimization-your-language-model-is-secretly-a-reward-model, which KTO is both improving on and explaining. The HALO framework reveals why DPO works: not just because it’s a clever reparameterization of RLHF, but because it accidentally encodes prospect-theoretic biases that match how humans actually perceive outcomes. Understanding training-language-models-to-follow-instructions-with-human-feedback (InstructGPT) helps here too — KTO is trying to solve the same objective as PPO-based RLHF but without a reward model or preference pairs.

You don’t need paired preferences to align a language model — a thumbs up or thumbs down on individual outputs, framed through prospect theory, is enough.

Connections

alignment — KTO is a new alignment objective in the HALO family
dpo — KTO extends and explains DPO; both are HALOs
rlhf — KTO achieves the RLHF objective without a reward model or PPO
reward-model — KTO eliminates the reward model, using binary labels directly
sft — KTO can skip SFT entirely at sufficient scale; DPO cannot
direct-preference-optimization-your-language-model-is-secretly-a-reward-model — the method KTO extends and supersedes for binary feedback settings
training-language-models-to-follow-instructions-with-human-feedback — the PPO-RLHF pipeline that both DPO and KTO aim to replace

Citation

arXiv:2402.01306

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. ICML 2024. https://arxiv.org/abs/2402.01306

ML Wiki

Explorer