Does DPO scale reliably past 70B?

Why this matters

If DPO matches PPO-RLHF at every scale, the alignment pipeline collapses to supervised learning — fewer models in memory, no critic, no reward model, no clipped policy gradient. If it doesn’t — if PPO-style RL outperforms at scale, or if DPO subtly over-optimizes the implicit reward — then frontier labs keep paying the RLHF tax. The answer reshapes which alignment stack is “default” for everyone below frontier compute.

Current best understanding

(2026-04-28) DPO is the default for open-weight models up to ~70B (LLaMA 2/3, Mistral, Qwen). Published evidence at frontier scale (GPT-4-class, Gemini Ultra, Claude 3 Opus) is sparse — frontier labs publish little about which alignment algorithm they use, and when they do (e.g. Anthropic’s Constitutional AI / RLAIF, OpenAI’s iterative RLHF), the recipes are hybrid. The DPO paper itself only validated up to 6B. Iterative / online DPO variants (rejection sampling + DPO loop, identity DPO, IPO) appear to close any gap that exists, suggesting the limitation is static offline preference data, not the loss function.

Evidence

[2024-05] direct-preference-optimization-your-language-model-is-secretly-a-reward-model — Original DPO paper. Experiments capped at GPT-J-6B / Pythia-2.8B. Claims a superior reward-KL frontier vs. PPO. Open question: does the frontier hold above 70B?
[2023] llama-2-open-foundation-fine-tuned-chat-models — LLaMA 2 used iterative RLHF with rejection sampling + PPO, not DPO. Suggests the LLaMA team didn’t trust DPO at 70B at the time, but doesn’t rule it out.
[2024] kto-model-alignment-prospect-theoretic-optimization — KTO works on unpaired binary signals where DPO requires pairs. Indirect evidence that the form of preference data, not the optimization, is often the bottleneck.
[2024] orpo-monolithic-preference-optimization — ORPO (odds ratio penalty in SFT) beats SFT+DPO at 2.7B and 7B; suggests the reference model and separate phase are the bottleneck, not the loss function shape. No 70B+ comparison available.

What would settle it

A frontier lab (Anthropic, OpenAI, Google DeepMind) publishing an apples-to-apples DPO vs. PPO comparison at 70B+.
An open replication of GPT-4-class alignment using DPO end-to-end, with eval results on the same harnesses.
Theoretical analysis showing whether DPO’s implicit reward over-optimizes monotonically as model scale grows (the “reward hacking” concern at scale).

ML Wiki

Explorer

Does DPO scale reliably past 70B?

Why this matters

Current best understanding

Evidence

What would settle it

Graph View

Table of Contents

Backlinks