Summary
InstructGPT fine-tunes GPT-3 using a three-stage pipeline — supervised fine-tuning on human demonstrations, reward model training on pairwise human rankings, and PPO optimization against the reward signal — to produce models that humans strongly prefer over the raw pretrained baseline. A 1.3B InstructGPT is preferred to the 175B GPT-3 in human evaluations despite having 100x fewer parameters, demonstrating that alignment via human feedback is far more effective per compute dollar than further scaling. The key innovation is treating “what humans want” as a learnable objective, operationalized through 40 human contractors writing demonstrations and ranking responses.
Key Claims
- Outputs from the 175B InstructGPT are preferred to GPT-3 175B outputs 85±3% of the time; preferred to few-shot GPT-3 71±4% of the time
- InstructGPT 1.3B is preferred over GPT-3 175B — alignment closes a 100x parameter gap
- Closed-domain hallucination rate: 21% (InstructGPT) vs 41% (GPT-3)
- Toxicity reduced ~25% when prompted to be respectful
- RLHF fine-tuning costs ~60 petaflop/s-days vs 3,640 for GPT-3 pretraining (1.6% of original compute)
- Inter-labeler agreement: ~73%; reward model achieves 69.6% on held-out labelers (near ceiling)
- Alignment tax is real: PPO alone degrades SQuAD, DROP, HellaSwag; PPO-ptx partially mitigates this
- When explicitly prompted to be toxic, InstructGPT is more toxic than GPT-3 — it follows user instructions, not an independent ethics
Methods
Three-stage RLHF pipeline:
Stage 1 — Supervised Fine-Tuning (SFT): 40 human contractors write ideal responses to ~13,000 prompts drawn from the OpenAI API. GPT-3 fine-tunes on these (prompt, ideal response) pairs using standard cross-entropy. Training: 16 epochs, cosine LR decay, residual dropout 0.2.
Stage 2 — Reward Model (RM): For ~33,000 prompts, the SFT model generates K=4–9 responses. Contractors rank them. Each ranking session yields C(K,2) pairwise comparisons (e.g. K=9 → 36 pairs). A 6B RM trains on these using a pairwise ranking loss: maximize log σ(r(x,y_w) − r(x,y_l)) for each (winner, loser) pair.
Stage 3 — PPO Fine-Tuning: The SFT model is fine-tuned using PPO to maximize RM scores, with a per-token KL divergence penalty from the original SFT model to prevent reward hacking. PPO-ptx variant also mixes in pretraining gradient updates (weight γ) to preserve general NLP performance.
Prompt distribution (RM dataset): Generation 45.6%, Open QA 12.4%, Brainstorming 11.2%, Chat 8.4%, Rewrite 6.6%, Summarization 4.2%, Classification 3.5%, other tasks.
Connections
- direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO (2023) eliminates the RM and PPO loop, replacing them with a supervised loss on preference data derived from the same insight
- attention-is-all-you-need — InstructGPT fine-tunes GPT-3, which is built on the Transformer architecture
- chain-of-thought-prompting — few-shot GPT-3 (one comparison baseline) uses chain-of-thought-style prompting
- rlhf — the three-stage pipeline (SFT → RM → PPO) this paper defines and validates
- sft — Stage 1: supervised fine-tuning on human-written demonstrations
- reward-model — Stage 2: 6B parameter RM trained on pairwise human rankings
- alignment — making model outputs match human intent is the central goal
- ppo — Stage 3: PPO optimizes the policy against the reward model with a KL penalty
- inference-efficiency — 1.3B InstructGPT preferred over 175B GPT-3, showing alignment is more efficient than scale
Citation
@article{ouyang2022training,
title={Training language models to follow instructions with human feedback},
author={Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, Ryan},
journal={arXiv preprint arXiv:2203.02155},
year={2022}
}