Learning to Summarize from Human Feedback

Concepts: rlhf | reward-model | ppo | alignment | sft Builds on: proximal-policy-optimization Leads to: training-language-models-to-follow-instructions-with-human-feedback | direct-preference-optimization-your-language-model-is-secretly-a-reward-model

The problem

For years, summarization models were trained to predict human-written summaries and evaluated using ROUGE — a metric that counts n-gram overlap between the model’s output and a reference text. Maximizing ROUGE sounds reasonable until you notice what it actually rewards: a model that copies the reference’s exact phrasing scores perfectly, even if the reference summary is mediocre. A model that captures the key idea in clearer, more concise language gets penalized.

The deeper issue is the ceiling. You cannot train a model to produce summaries better than the reference summaries it’s learning to predict. And those references — written by Reddit users as hurried TL;DRs — are often too long, miss the key point, or bury the actual outcome. As the paper puts it: “both of these metrics are rough proxies for what we really care about — summary quality.”

Optimizing the wrong objective doesn’t just give you worse models. It gives you models that look good on the benchmark while failing at the actual task.

The core idea

Let’s think about how you’d actually train someone to write good summaries. One approach: show them thousands of examples written by an average practitioner and have them imitate that style. The second approach: show them two candidate summaries and ask a simpler question — which is better?

Those two approaches are fundamentally different. The first anchors the learner to the quality floor of your training data. The second teaches them to optimize for a judgment, which can exceed any individual example.

“We measure quality using comparisons rather than ratings because comparisons are easier for labelers to make consistently.”

This is the foundation. When two annotators independently rate a summary on a 1-7 scale, they’ll often disagree by 2-3 points — they have different internal scales and apply them inconsistently. When those same annotators look at two summaries side-by-side and pick the better one, they agree the vast majority of the time. Comparisons flatten out individual biases in a way that absolute ratings cannot.

The paper builds a three-stage pipeline on this insight.

Stage 1: Supervised fine-tuning

Start with a pre-trained GPT-3 model. Fine-tune it on the TL;DR dataset (Reddit posts paired with author-written summaries). This gives you $π_{SFT}$ — a baseline policy that generates reasonable summaries, but bounded by the quality of the training data. This is the starting point, not the destination.

Stage 2: Train a reward model

Sample two summaries from $π_{SFT}$ for each post. Human labelers pick which is better. Collect tens of thousands of these pairs.

Train a reward model $r_{ϕ} (x, y)$ — a neural network initialized from GPT-3 with a scalar output head — to predict which summary the human preferred. The loss is the Bradley-Terry pairwise ranking objective:

$L_{RM} = - E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))]$

Where $y_{w}$ is the preferred summary and $y_{l}$ is the less-preferred one. $σ$ is the sigmoid function. The model is trained to maximize the predicted score gap between winner and loser. No absolute scales, no ratings — just pairs and preferences.

Stage 3: PPO fine-tuning

Now use the reward model as the training signal. Run PPO to maximize expected reward, but with a KL penalty anchoring the policy to $π_{SFT}$ :

$J (θ) = E_{x \sim D, y \sim π_{θ} (\cdot ∣ x)} [r_{ϕ} (x, y) - β \cdot KL [π_{θ} (\cdot ∣ x) ∥ π_{SFT} (\cdot ∣ x)]]$

The $β$ parameter is the safety valve. Too low, and the policy finds adversarial patterns in the reward model — generating summaries that score high but are incoherent, repetitive, or stylistically broken. The KL term keeps the policy close to the sensible baseline while allowing improvement.

STAGE 1: SFT
  Reddit post  +  author's TL;DR
              ↓  cross-entropy
          π_SFT   (bounded by reference quality)

────────────────────────────────────────────
STAGE 2: REWARD MODEL
  For each post x:
    π_SFT generates  [summary_A, summary_B]
    Human picks:     A is better

    → Comparison:  (x,  y_A=winner,  y_B=loser)

  r_φ trained to predict:  r(x, y_A) > r(x, y_B)
  Loss = -log σ(r(x, y_A) − r(x, y_B))

────────────────────────────────────────────
STAGE 3: PPO
  For each post x:
    π_θ generates summary y
    r_φ(x, y)           → reward
    KL[π_θ ‖ π_SFT]     → penalty (stay coherent)

  Update: push up prob of high-reward summaries
          while staying near π_SFT

  Result: summaries humans prefer over SFT baseline
          and over the original reference summaries

Walkthrough with real numbers

Let’s trace a comparison through reward model training.

Post $x$ : “TIFU by accidentally texting my boss instead of my friend saying I deserved a raise and was better than everyone at work. He called me in the next day and told me he agreed.”

Summary A ( $y_{w}$ , preferred): “Accidentally texted boss instead of friend about deserving a raise; boss called them in and agreed.”

Summary B ( $y_{l}$ , less preferred): “Sent embarrassing text to the wrong person at work about money.”

A is better. It captures the key detail (boss was the recipient, the outcome was positive). B is vague and loses the punchline.

Early in reward model training, the model barely distinguishes them:

r(x, A) = 0.3,  r(x, B) = 0.2
Difference = 0.1
σ(0.1) = 0.525   ← model only 52.5% confident A is better
Loss = -log(0.525) = 0.644

After training on tens of thousands of human comparisons:

r(x, A) = 2.1,  r(x, B) = -0.4
Difference = 2.5
σ(2.5) = 0.924   ← model 92.4% confident A is better
Loss = -log(0.924) = 0.079

Now in Stage 3, when the policy generates a summary, it receives a reward of 2.1 if it captures the key details or -0.4 if it hedges into vagueness. The PPO gradient pushes up the probability of specific, accurate summaries and pushes down the probability of vague ones.

What’s clever

The non-obvious move is using comparisons as the training signal rather than demonstrations. This separates what you want from what a specific human would write. A labeler who prefers clear, specific summaries might write mediocre ones themselves — but they’ll consistently identify the better of two candidates. The comparison format extracts judgment without requiring demonstration.

“We find that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans.”

The generalization finding is the real surprise. The reward model was trained entirely on Reddit posts. Evaluated on CNN/DM news articles — a different domain, different length, different register — summaries that score highly on the reward model are also the ones humans prefer. The model hasn’t learned “what Reddit TL;DRs look like.” It’s learned something more general about what makes a summary useful.

The third insight is the one that reshaped the field: the training objective matters more than model size. A 6B parameter model trained with RLHF produces summaries that humans prefer over a 175B parameter model trained with supervised fine-tuning alone.

“Our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone.”

Optimizing the right signal closes a 30x parameter gap. You can’t brute-force your way out of a bad objective.

Results

Model	Win rate vs. SFT 6B baseline	Transfer to CNN/DM
SFT 6B	baseline	modest
Human reference (TL;DR)	≈ SFT (neither preferred)	strong
RLHF 6B	~70% preferred	strong (nearly matches human reference)
SFT 175B (30× larger)	RLHF 6B still preferred	RLHF 6B still preferred

The CNN/DM transfer result is particularly striking: the RLHF policy, trained entirely on Reddit, produces news summaries nearly as good as the human reference — without any news-specific fine-tuning. The reward model generalizes; the policy generalizes with it.

What doesn’t work:

Reward hacking is the paper’s honest acknowledgment of the pipeline’s weakness. Policies trained with a low KL penalty (allowed to deviate far from $π_{SFT}$ ) produce summaries that score high but are broken in practice — verbatim repetition of the post, bizarre formatting, truncated endings. Goodhart’s Law, applied to language: when the measure becomes the target, it ceases to be a good measure.

Labeler disagreement compounds the problem. Different annotators have different preferences for length, density, and style. The reward model inherits whatever biases are embedded in the comparison dataset. If your labelers systematically prefer longer summaries, your policy will learn to pad.

There’s also a coverage problem: the reward model is only as good as the summaries it was trained to compare. If $π_{SFT}$ never generates a certain type of high-quality summary, human labelers never see it, and the reward model can’t learn to reward it. The reward model’s ceiling is bounded by what the SFT policy can produce.

So what?

If you’re building a generation system today, this paper’s message is simple: don’t trust your training loss as a proxy for output quality. ROUGE is not quality. Cross-entropy on human references is not quality. Quality is what a human prefers when they see two outputs side by side. The only way to capture that is through comparison data and a reward model trained on it.

The practical playbook this paper establishes — SFT, then reward model on pairwise comparisons, then PPO with KL penalty — is still the canonical RLHF recipe. InstructGPT scaled it to general instruction following. DPO removed the PPO step by solving for the optimal policy in closed form. GRPO removed the value model, cutting memory requirements in half. KTO removed the need for pairwise comparisons entirely.

Every one of those follow-on papers was a response to a specific bottleneck this paper introduced. The pipeline worked — it was just expensive. The field spent five years making it cheaper, and that work traces directly back to what Stiennon et al. demonstrated here.

Training a reward model on human comparison data and optimizing against it with PPO is the move that unlocked alignment at scale — it replaced “train on references” with “train on judgment,” and judgment turned out to scale much further than mimicry.

Paper: Learning to summarize from human feedback — Stiennon, Ouyang, Wu, Ziegler, Lowe, Voss, Radford, Amodei, Christiano — NeurIPS 2020

Connections

rlhf — this paper establishes the modern three-stage RLHF pipeline
reward-model — the reward model trained on pairwise comparisons is the core mechanism
ppo — PPO is the RL algorithm used in Stage 3 to optimize the policy
alignment — RLHF is the primary technique for aligning LLMs with human preferences
sft — SFT is Stage 1, providing the initialization for both policy and reward model
proximal-policy-optimization — PPO algorithm used in Stage 3
training-language-models-to-follow-instructions-with-human-feedback — InstructGPT applies this pipeline at scale to general instruction following
direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO replaces PPO with supervised learning on preferences, removing the RL step
grpo-deepseekmath-group-relative-policy-optimization — GRPO eliminates the value model from PPO, making Stage 3 memory-efficient
kto-model-alignment-prospect-theoretic-optimization — KTO extends RLHF to work without paired comparisons

Citation

arXiv:2009.01325

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to summarize from human feedback. NeurIPS 2020. https://arxiv.org/abs/2009.01325

ML Wiki

Explorer