Concepts: ppo | policy-gradient | reinforcement-learning | rlhf Builds on: Trust Region Policy Optimization (TRPO, Schulman et al. 2015) Leads to: learning-to-summarize-human-feedback | grpo-deepseekmath-group-relative-policy-optimization
Every RLHF system you’ve heard of — InstructGPT, ChatGPT, Claude — trains the final alignment step with PPO. Not because it’s the only option, but because it’s the algorithm that actually works at scale without requiring a PhD to tune. This paper, published by Schulman and colleagues at OpenAI in 2017, solved a problem that had blocked practical RL for years: how do you train a policy to improve without it catastrophically forgetting everything it learned?
The core idea
The analogy: Imagine you’re coaching an athlete who performed a great move in practice. You want them to do that move more. The naive approach: “that worked, so do it 10x more often.” The problem: the move worked in a specific context, against a specific opponent, on a specific surface. Forcing them to repeat it obsessively destroys their overall game — they become one-dimensional, easy to counter, worse than before you started “optimizing.”
The right approach: encourage the move, but put a governor on how much their style can shift each session. Update. Observe. Update again. Small corrections compound into large improvements without ever taking a step so large you can’t recover.
PPO is that governor for neural networks.
Before PPO, the state-of-the-art was TRPO (Trust Region Policy Optimization). TRPO is theoretically elegant: it mathematically computes the maximum safe step size by solving a constrained optimization problem. But it requires computing the curvature of the KL divergence between old and new policy — second-order derivatives, conjugate gradient solvers, line searches. It can’t easily be applied to networks that share parameters between the policy and value function. In practice, implementing TRPO correctly is hard; getting it to train a large model is harder still.
“We propose a novel objective function that enables multiple epochs of minibatch updates… with sample complexity much better than basic policy gradient methods, and simpler to implement than TRPO.”
PPO’s insight: you don’t need the exact KL computation. You need the effect — preventing destructively large updates. And you can achieve that effect by clipping the probability ratio directly.
The mechanism, step by step
Every policy gradient method starts from the same place: update the policy to take actions that led to high rewards more often, and low-reward actions less often. The basic policy gradient objective is:
where is the estimated advantage — how much better (or worse) this action was relative to the average expected outcome from this state.
The problem: nothing in this objective prevents the policy from updating too aggressively. One good experience can drive the policy to change dramatically, collapsing into a degenerate solution.
TRPO adds a hard constraint: the KL divergence between old and new policy must stay below some threshold :
Computing this constraint requires second-order optimization. PPO drops the constraint and replaces it with a clipped surrogate objective.
Define the probability ratio:
When , the new policy is doing this action more than the old policy. When , it’s doing it less. The unclipped objective gives unbounded credit for moving in the advantageous direction.
PPO-Clip limits that credit:
The clip range is , typically with . The min makes this a pessimistic lower bound:
- New policy strongly favors a good action (): clip fires, objective stops growing. No extra credit for overconfidence.
- New policy makes a modest update (): clip doesn’t engage, objective behaves like unclipped.
- New policy does a bad action more (): unclipped term is more negative, min preserves full penalty.
Full objective adds a value function loss and entropy bonus:
where:
- — value function loss, MSE against the TD target
- — entropy bonus, preventing the policy from collapsing to deterministic too quickly
- — from Table 4 in the paper
TRPO:
Collect data → gradient → compute KL curvature → solve for max safe step
→ conjugate gradient → line search → single update
(One expensive update per batch. Can't share policy/value params easily.)
PPO:
Collect data → clipped surrogate → SGD for K=10 epochs → update
(Multiple cheap minibatch passes. Simple. Works with shared params.)
THE CLIP IN ACTION (ε = 0.2, clip range = [0.8, 1.2]):
Scenario A — big move toward a good action (Â = +1.5):
π_old(a|s) = 0.30, π_new(a|s) = 0.45
r_t = 0.45 / 0.30 = 1.50 (50% more likely)
r_t × Â = 1.50 × 1.5 = 2.25 (unclipped)
clip(1.50,0.8,1.2) = 1.20
clip × Â = 1.20 × 1.5 = 1.80 (clipped)
min(2.25, 1.80) = 1.80 ← CLIP FIRES. Update capped.
Scenario B — small move (r_t within range):
π_old = 0.30, π_new = 0.33
r_t = 1.10 (within [0.8, 1.2])
r_t × Â = 1.10 × 1.5 = 1.65
clip × Â = 1.10 × 1.5 = 1.65 (no clipping)
min(1.65, 1.65) = 1.65 ← PASSES THROUGH.
Scenario C — big move toward a bad action (Â = -0.8):
r_t = 1.50 (doing a bad action 50% more)
r_t × Â = 1.50 × (-0.8) = -1.20
clip(1.50,0.8,1.2) = 1.20
clip × Â = 1.20 × (-0.8) = -0.96
min(-1.20, -0.96) = -1.20 ← UNCLIPPED. Full penalty paid.
(Moving toward a bad action: no forgiveness)
“The clip removes the incentive for moving outside of the interval … the final surrogate objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.”
This is the core guarantee. Maximizing a pessimistic lower bound means you never accidentally gain by taking a step that overshoots the trust region.
Advantage estimation (GAE): is computed via Generalized Advantage Estimation:
(discount factor), (bias-variance tradeoff). This smooths between Monte Carlo returns (high variance) and pure TD bootstrapping (high bias).
The multiple-epoch trick: A key practical contribution is reusing each collected batch for epochs of minibatch updates. Basic policy gradient throws away data after one gradient step. PPO’s clip prevents the policy from drifting far from the collection policy across those 10 passes — so you safely extract more signal from each rollout and improve sample efficiency significantly.
Find the instinct: What is TRPO actually doing? Preventing the new policy from diverging too far from the old one, where divergence is measured by KL. The probability ratio is a direct proxy for divergence: means nothing changed, means the action became twice as likely. Clipping is a first-order approximation of the KL constraint — simpler, cheaper, and empirically sufficient. The PPO authors bet that the theoretical guarantees of TRPO weren’t necessary in practice; what mattered was the effect. They were right.
Results
Benchmarked on 7 MuJoCo continuous-control tasks (1M timesteps) and 49 Atari games (40M frames).
| Environment | PPO | TRPO | A2C |
|---|---|---|---|
| HalfCheetah | 1668 | 1563 | 1096 |
| Walker2d | 3424 | 3259 | 672 |
| Swimmer | 111 | 118 | 46 |
| Hopper | 2496 | 3461* | 1003 |
*TRPO wins on Hopper individually but PPO wins the overall 7-environment benchmark average.
Ablation: Six variants tested — no clipping, fixed KL penalty, adaptive KL penalty, clip-only, clip+value loss, clip+value+entropy. The full PPO-Clip objective wins the benchmark suite. No clipping finishes last. Adaptive KL (PPO-Penalty) is competitive but requires tuning per environment.
On Atari (40M frames): PPO matches A3C on most games, outperforms A2C significantly, runs faster wall-clock by avoiding asynchronous workers.
What doesn’t work: PPO is sensitive to the clip range — too small and learning stalls, too large and you get instability. Value function coefficient and entropy bonus require per-domain tuning. On sparse-reward tasks, advantage estimates are noisy and learning can stall for long periods. The method also runs a value network alongside the policy — roughly doubling compute vs. pure policy gradient — which later motivated grpo-deepseekmath-group-relative-policy-optimization to eliminate it entirely.
Practical implications
If you’re building ML systems where RL is involved — language model alignment, reward-model optimization, game-playing — PPO is the default starting point. It’s not theoretically optimal, but it’s robust, well-documented, and has community implementations across every major framework.
For RLHF specifically, PPO is used to optimize the language model policy against a learned reward model . A KL penalty against the supervised fine-tuned reference policy is added to prevent reward hacking:
Without the KL term, the model drifts toward degenerate high-reward outputs — repetitive, sycophantic, or incoherent text that scores well on the reward model but fails in practice.
When memory is the bottleneck at scale, switch to GRPO: it drops the value model, cuts training memory by ~25%, and uses within-group advantage normalization instead of per-token estimates. For verifiable tasks (math, code), GRPO often matches PPO performance with less overhead. When you don’t have reward signals at all and only have preference data, look at DPO — it skips the RL loop entirely.
Clip the probability ratio to [0.8, 1.2], run 10 gradient epochs on each batch, and you get 90% of TRPO’s stability at 10% of the implementation cost.
Connections
- ppo — the algorithm introduced in this paper
- policy-gradient — PPO is a policy gradient method with a stabilized surrogate objective
- reinforcement-learning — general RL framework PPO operates within
- rlhf — PPO is the standard RL optimizer in RLHF pipelines
- learning-to-summarize-human-feedback — first application of PPO to RLHF at scale for language models
- grpo-deepseekmath-group-relative-policy-optimization — eliminates the value model from PPO for memory efficiency
Citation
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint. https://arxiv.org/abs/1707.06347