Why this matters
The 2022-23 RLHF recipe (SFT → reward model → PPO) was the answer for one model generation. Since then: DPO collapsed reward modeling, KTO removed paired preferences, Constitutional AI / RLAIF replaced human feedback with model feedback, and DeepSeek-R1 used pure RL on verifiable rewards. The landscape is now a menu, not a pipeline, and which subset works best at frontier scale is unclear. Picking wrong wastes compute; picking right unlocks 2-10× cheaper alignment.
Current best understanding
(2026-05-05) There’s a rough consensus on the shape:
- SFT on high-quality demonstrations (LIMA showed quality > quantity here).
- Preference learning via DPO or KTO for general helpfulness and harmlessness, often iteratively.
- RLAIF / Constitutional AI to scale preference data beyond what humans can label.
- RL on verifiable rewards (math, code, logic) for reasoning capabilities — the DeepSeek-R1 path.
What’s not settled: whether the stages are sequential, interleaved, or jointly optimized; how much each contributes at frontier scale; how the stages interact (does later RL erase earlier SFT? does RLAIF preserve nuance from human labels?).
Evidence
- training-language-models-to-follow-instructions-with-human-feedback — InstructGPT, the original SFT → RM → PPO recipe.
- learning-to-summarize-human-feedback — The pre-InstructGPT precedent showing reward-model + PPO works.
- direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO collapses RM + PPO into one supervised step.
- kto-model-alignment-prospect-theoretic-optimization — KTO removes paired-preference requirement.
- constitutional-ai-harmlessness-from-ai-feedback — Constitutional AI / RLAIF: scale preference data with model self-critique.
- lima-less-is-more-for-alignment — Quality dominates quantity in SFT data.
- deepseek-r1-reasoning-via-reinforcement-learning — RL on verifiable rewards as a reasoning-specific track.
- [2024] orpo-monolithic-preference-optimization — collapses SFT + preference alignment into one monolithic pass with an odds ratio penalty; no reference model, no separate DPO stage; beats Zephyr-β (SFT+DPO) at 7B.
- [2024] self-rewarding-language-models — removes the frozen reward model ceiling: the LLM scores its own outputs via LLM-as-a-Judge, and both instruction-following and reward-modeling ability improve across 3 iterative DPO rounds. LLaMA 2 70B M₃ reaches 20.44% AlpacaEval 2.0 win rate, beating Claude 2 and Gemini Pro using only ~3,200 seed examples.
What would settle it
- A controlled head-to-head ablation at 70B+: each stage on/off, measuring downstream eval lift per FLOP.
- A frontier lab publishing an end-to-end recipe with rationale (rare — most stop at “we used RLHF”).
- Long-running evals on whether iterative DPO or PPO produces less reward-hacking over many alignment rounds.