Training language models to follow instructions with human feedback (InstructGPT)

Summary

InstructGPT fine-tunes GPT-3 using a three-stage pipeline — supervised fine-tuning on human demonstrations, reward model training on pairwise human rankings, and PPO optimization against the reward signal — to produce models that humans strongly prefer over the raw pretrained baseline. A 1.3B InstructGPT is preferred to the 175B GPT-3 in human evaluations despite having 100x fewer parameters, demonstrating that alignment via human feedback is far more effective per compute dollar than further scaling. The key innovation is treating “what humans want” as a learnable objective, operationalized through 40 human contractors writing demonstrations and ranking responses.

Key Claims

Outputs from the 175B InstructGPT are preferred to GPT-3 175B outputs 85±3% of the time; preferred to few-shot GPT-3 71±4% of the time
InstructGPT 1.3B is preferred over GPT-3 175B — alignment closes a 100x parameter gap
Closed-domain hallucination rate: 21% (InstructGPT) vs 41% (GPT-3)
Toxicity reduced ~25% when prompted to be respectful
RLHF fine-tuning costs ~60 petaflop/s-days vs 3,640 for GPT-3 pretraining (1.6% of original compute)
Inter-labeler agreement: ~73%; reward model achieves 69.6% on held-out labelers (near ceiling)
Alignment tax is real: PPO alone degrades SQuAD, DROP, HellaSwag; PPO-ptx partially mitigates this
When explicitly prompted to be toxic, InstructGPT is more toxic than GPT-3 — it follows user instructions, not an independent ethics

Methods

Three-stage RLHF pipeline:

Stage 1 — Supervised Fine-Tuning (SFT): 40 human contractors write ideal responses to ~13,000 prompts drawn from the OpenAI API. GPT-3 fine-tunes on these (prompt, ideal response) pairs using standard cross-entropy. Training: 16 epochs, cosine LR decay, residual dropout 0.2.

Stage 2 — Reward Model (RM): For ~33,000 prompts, the SFT model generates K=4–9 responses. Contractors rank them. Each ranking session yields C(K,2) pairwise comparisons (e.g. K=9 → 36 pairs). A 6B RM trains on these using a pairwise ranking loss: maximize log σ(r(x,y_w) − r(x,y_l)) for each (winner, loser) pair.

Stage 3 — PPO Fine-Tuning: The SFT model is fine-tuned using PPO to maximize RM scores, with a per-token KL divergence penalty from the original SFT model to prevent reward hacking. PPO-ptx variant also mixes in pretraining gradient updates (weight γ) to preserve general NLP performance.

Prompt distribution (RM dataset): Generation 45.6%, Open QA 12.4%, Brainstorming 11.2%, Chat 8.4%, Rewrite 6.6%, Summarization 4.2%, Classification 3.5%, other tasks.

Connections

direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO (2023) eliminates the RM and PPO loop, replacing them with a supervised loss on preference data derived from the same insight
attention-is-all-you-need — InstructGPT fine-tunes GPT-3, which is built on the Transformer architecture
chain-of-thought-prompting — few-shot GPT-3 (one comparison baseline) uses chain-of-thought-style prompting
rlhf — the three-stage pipeline (SFT → RM → PPO) this paper defines and validates
sft — Stage 1: supervised fine-tuning on human-written demonstrations
reward-model — Stage 2: 6B parameter RM trained on pairwise human rankings
alignment — making model outputs match human intent is the central goal
ppo — Stage 3: PPO optimizes the policy against the reward model with a KL penalty
inference-efficiency — 1.3B InstructGPT preferred over 175B GPT-3, showing alignment is more efficient than scale

Citation

arXiv:2203.02155

@article{ouyang2022training,
  title={Training language models to follow instructions with human feedback},
  author={Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ryan},
  journal={arXiv preprint arXiv:2203.02155},
  year={2022}
}

ML Wiki

Explorer

Training language models to follow instructions with human feedback (InstructGPT)

Summary

Key Claims

Methods

Connections

Citation

Graph View

Table of Contents

Backlinks