KTO: Model Alignment as Prospect Theoretic Optimization

Stub — full ingest pending.

Ethayarajh et al. (2024) propose KTO, an alignment objective derived from Kahneman-Tversky prospect theory. Unlike DPO, which requires paired preference comparisons (A is better than B), KTO only needs binary labels — good or bad — on individual outputs. The loss function reflects the asymmetric human psychology of gains vs. losses: bad outputs are penalized more than good outputs are rewarded. KTO matches or exceeds DPO on benchmarks while working with unpaired data, making it applicable to existing labeled datasets.

Key claim: Prospect-theoretic loss on binary good/bad labels matches DPO quality without requiring paired preference comparisons.

ML Wiki

Explorer

KTO: Model Alignment as Prospect Theoretic Optimization

Graph View

Backlinks