What's the right alignment stack post-RLHF?

Why this matters

The 2022-23 RLHF recipe (SFT → reward model → PPO) was the answer for one model generation. Since then: DPO collapsed reward modeling, KTO removed paired preferences, Constitutional AI / RLAIF replaced human feedback with model feedback, and DeepSeek-R1 used pure RL on verifiable rewards. The landscape is now a menu, not a pipeline, and which subset works best at frontier scale is unclear. Picking wrong wastes compute; picking right unlocks 2-10× cheaper alignment.

Current best understanding

(2026-05-05) There’s a rough consensus on the shape:

SFT on high-quality demonstrations (LIMA showed quality > quantity here).
Preference learning via DPO or KTO for general helpfulness and harmlessness, often iteratively.
RLAIF / Constitutional AI to scale preference data beyond what humans can label.
RL on verifiable rewards (math, code, logic) for reasoning capabilities — the DeepSeek-R1 path.

What’s not settled: whether the stages are sequential, interleaved, or jointly optimized; how much each contributes at frontier scale; how the stages interact (does later RL erase earlier SFT? does RLAIF preserve nuance from human labels?).

Evidence

training-language-models-to-follow-instructions-with-human-feedback — InstructGPT, the original SFT → RM → PPO recipe.
learning-to-summarize-human-feedback — The pre-InstructGPT precedent showing reward-model + PPO works.
direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO collapses RM + PPO into one supervised step.
kto-model-alignment-prospect-theoretic-optimization — KTO removes paired-preference requirement.
constitutional-ai-harmlessness-from-ai-feedback — Constitutional AI / RLAIF: scale preference data with model self-critique.
lima-less-is-more-for-alignment — Quality dominates quantity in SFT data.
deepseek-r1-reasoning-via-reinforcement-learning — RL on verifiable rewards as a reasoning-specific track.
[2024] orpo-monolithic-preference-optimization — collapses SFT + preference alignment into one monolithic pass with an odds ratio penalty; no reference model, no separate DPO stage; beats Zephyr-β (SFT+DPO) at 7B.
[2024] self-rewarding-language-models — removes the frozen reward model ceiling: the LLM scores its own outputs via LLM-as-a-Judge, and both instruction-following and reward-modeling ability improve across 3 iterative DPO rounds. LLaMA 2 70B M₃ reaches 20.44% AlpacaEval 2.0 win rate, beating Claude 2 and Gemini Pro using only ~3,200 seed examples.

What would settle it

A controlled head-to-head ablation at 70B+: each stage on/off, measuring downstream eval lift per FLOP.
A frontier lab publishing an end-to-end recipe with rationale (rare — most stop at “we used RLHF”).
Long-running evals on whether iterative DPO or PPO produces less reward-hacking over many alignment rounds.

ML Wiki

Explorer

What's the right alignment stack post-RLHF?

Why this matters

Current best understanding

Evidence

What would settle it

Graph View

Table of Contents

Backlinks