Why this matters
If DPO matches PPO-RLHF at every scale, the alignment pipeline collapses to supervised learning — fewer models in memory, no critic, no reward model, no clipped policy gradient. If it doesn’t — if PPO-style RL outperforms at scale, or if DPO subtly over-optimizes the implicit reward — then frontier labs keep paying the RLHF tax. The answer reshapes which alignment stack is “default” for everyone below frontier compute.
Current best understanding
(2026-04-28) DPO is the default for open-weight models up to ~70B (LLaMA 2/3, Mistral, Qwen). Published evidence at frontier scale (GPT-4-class, Gemini Ultra, Claude 3 Opus) is sparse — frontier labs publish little about which alignment algorithm they use, and when they do (e.g. Anthropic’s Constitutional AI / RLAIF, OpenAI’s iterative RLHF), the recipes are hybrid. The DPO paper itself only validated up to 6B. Iterative / online DPO variants (rejection sampling + DPO loop, identity DPO, IPO) appear to close any gap that exists, suggesting the limitation is static offline preference data, not the loss function.
Evidence
- [2024-05] direct-preference-optimization-your-language-model-is-secretly-a-reward-model — Original DPO paper. Experiments capped at GPT-J-6B / Pythia-2.8B. Claims a superior reward-KL frontier vs. PPO. Open question: does the frontier hold above 70B?
- [2023] llama-2-open-foundation-fine-tuned-chat-models — LLaMA 2 used iterative RLHF with rejection sampling + PPO, not DPO. Suggests the LLaMA team didn’t trust DPO at 70B at the time, but doesn’t rule it out.
- [2024] kto-model-alignment-prospect-theoretic-optimization — KTO works on unpaired binary signals where DPO requires pairs. Indirect evidence that the form of preference data, not the optimization, is often the bottleneck.
- [2024] orpo-monolithic-preference-optimization — ORPO (odds ratio penalty in SFT) beats SFT+DPO at 2.7B and 7B; suggests the reference model and separate phase are the bottleneck, not the loss function shape. No 70B+ comparison available.
What would settle it
- A frontier lab (Anthropic, OpenAI, Google DeepMind) publishing an apples-to-apples DPO vs. PPO comparison at 70B+.
- An open replication of GPT-4-class alignment using DPO end-to-end, with eval results on the same harnesses.
- Theoretical analysis showing whether DPO’s implicit reward over-optimizes monotonically as model scale grows (the “reward hacking” concern at scale).