Why this matters

If DPO matches PPO-RLHF at every scale, the alignment pipeline collapses to supervised learning — fewer models in memory, no critic, no reward model, no clipped policy gradient. If it doesn’t — if PPO-style RL outperforms at scale, or if DPO subtly over-optimizes the implicit reward — then frontier labs keep paying the RLHF tax. The answer reshapes which alignment stack is “default” for everyone below frontier compute.

Current best understanding

(2026-04-28) DPO is the default for open-weight models up to ~70B (LLaMA 2/3, Mistral, Qwen). Published evidence at frontier scale (GPT-4-class, Gemini Ultra, Claude 3 Opus) is sparse — frontier labs publish little about which alignment algorithm they use, and when they do (e.g. Anthropic’s Constitutional AI / RLAIF, OpenAI’s iterative RLHF), the recipes are hybrid. The DPO paper itself only validated up to 6B. Iterative / online DPO variants (rejection sampling + DPO loop, identity DPO, IPO) appear to close any gap that exists, suggesting the limitation is static offline preference data, not the loss function.

Evidence

What would settle it

  • A frontier lab (Anthropic, OpenAI, Google DeepMind) publishing an apples-to-apples DPO vs. PPO comparison at 70B+.
  • An open replication of GPT-4-class alignment using DPO end-to-end, with eval results on the same harnesses.
  • Theoretical analysis showing whether DPO’s implicit reward over-optimizes monotonically as model scale grows (the “reward hacking” concern at scale).