Why this matters

The 2022-23 RLHF recipe (SFT → reward model → PPO) was the answer for one model generation. Since then: DPO collapsed reward modeling, KTO removed paired preferences, Constitutional AI / RLAIF replaced human feedback with model feedback, and DeepSeek-R1 used pure RL on verifiable rewards. The landscape is now a menu, not a pipeline, and which subset works best at frontier scale is unclear. Picking wrong wastes compute; picking right unlocks 2-10× cheaper alignment.

Current best understanding

(2026-05-05) There’s a rough consensus on the shape:

  1. SFT on high-quality demonstrations (LIMA showed quality > quantity here).
  2. Preference learning via DPO or KTO for general helpfulness and harmlessness, often iteratively.
  3. RLAIF / Constitutional AI to scale preference data beyond what humans can label.
  4. RL on verifiable rewards (math, code, logic) for reasoning capabilities — the DeepSeek-R1 path.

What’s not settled: whether the stages are sequential, interleaved, or jointly optimized; how much each contributes at frontier scale; how the stages interact (does later RL erase earlier SFT? does RLAIF preserve nuance from human labels?).

Evidence

What would settle it

  • A controlled head-to-head ablation at 70B+: each stage on/off, measuring downstream eval lift per FLOP.
  • A frontier lab publishing an end-to-end recipe with rationale (rare — most stop at “we used RLHF”).
  • Long-running evals on whether iterative DPO or PPO produces less reward-hacking over many alignment rounds.