Stub — full ingest pending.

Stiennon et al. (2020) train a reward model on human preference comparisons between summaries and then use PPO to optimize a language model against the reward model. This is the foundational RLHF paper — the methodology it introduces (preference data collection, reward model training, PPO fine-tuning, KL penalty) is directly inherited by InstructGPT and ChatGPT. The paper demonstrates that RLHF produces summaries that humans strongly prefer over supervised fine-tuning baselines.

Key claim: Training on human preference comparisons via a reward model and PPO produces outputs that humans prefer over supervised fine-tuning alone.