What It Is

PPO is a reinforcement learning algorithm that constrains each gradient update to stay close (“proximal”) to the current policy, preventing catastrophically large parameter changes. It’s the RL algorithm used in the RLHF stage of InstructGPT and most early aligned LLMs. Schulman et al. (2017) at OpenAI designed it as a practical, stable replacement for TRPO (Trust Region Policy Optimization) that achieves similar stability guarantees with first-order (gradient) methods only.

Why It Matters

Vanilla policy gradient methods are unstable — a single bad update can destroy months of pretraining by shifting the model into a degenerate region. This is especially catastrophic for LLM fine-tuning where the starting policy (the SFT model) embeds enormous pretrained knowledge. PPO’s clipping constraint makes RL training stable enough to work at LLM scale. Without PPO (or its successors like DPO), applying RL to billion-parameter models would be practically infeasible.

The Trust Region Intuition

The core problem: policy gradient tells you which direction to update the policy, but says nothing about step size. A gradient step that’s too large can move the policy to a region where old trajectory samples are no longer representative — the policy that collected the data is no longer the policy being updated. This distribution shift invalidates the gradient estimate and can cause irreversible collapse.

TRPO enforces a hard constraint: update the policy only if the KL divergence from old to new policy stays below ε. This is theoretically clean but requires computing the Fisher Information Matrix — expensive and complex.

PPO approximates the same constraint with a clipped objective that any standard optimizer can handle.

How It Works

At each step, compute the probability ratio of the new policy to the old policy for a given action:

Where:

  • π_θ(a|s) — probability of action a in state s under current parameters θ
  • π_θ_old(a|s) — same probability under the policy that collected the trajectory (old parameters)
  • r(θ) = 1 — policy hasn’t changed for this action
  • r(θ) > 1 — new policy assigns higher probability to this action than when the data was collected
  • r(θ) < 1 — new policy assigns lower probability

The PPO clipped objective:

Where:

  • A_t — advantage estimate: how much better this action was than the baseline expected return
  • ε — clip parameter, typically 0.2; limits updates to ±20% probability ratio change
  • min(...) — take the pessimistic (lower) bound between clipped and unclipped

Intuition for the min:

  • If advantage is positive (good action) and r > 1+ε (new policy already increased probability aggressively): the clipped term caps the gradient — no further reward for over-committing.
  • If advantage is positive and r < 1+ε: unclipped, take the full gradient.
  • If advantage is negative (bad action) and r < 1-ε: clipped, don’t punish beyond the threshold.
  • If advantage is negative and r > 1-ε: unclipped, penalize freely to move away from bad action.

The asymmetry is key: clipping cuts off excessive positive updates, but never blocks learning from mistakes.

Advantage Estimation

The advantage A_t measures how much better action a_t was than what the critic (value function) expected:

Where:

  • r_t — reward received at step t
  • γ — discount factor (typically 0.99)
  • V(s_t) — critic’s estimate of expected future return from state s_t
  • The critic is a separate neural network trained to minimize (A_t)^2

In practice, PPO uses Generalized Advantage Estimation (GAE), which trades off bias vs. variance across a λ-weighted multi-step horizon:

PPO in RLHF

For LLM fine-tuning, the “state” is the prompt + partial response, the “action” is the next token, and the “reward” comes from the reward model after the complete response is generated.

Prompt → LLM generates full response
             ↓
Reward model assigns scalar score: r_φ(prompt, response)
             ↓
KL penalty subtracted: β × KL(π_θ || π_sft)
             ↓
Net reward = r_φ - β × KL
             ↓
PPO updates LLM weights to increase net reward

The combined reward signal:

Where:

  • r_φ(response) — reward model score (proxy for human preference)
  • β — KL penalty weight (typically 0.1-0.2)
  • KL(π_θ || π_sft) — divergence from the SFT reference policy

The KL term is critical: without it, the LLM would “reward hack” — find adversarial outputs that maximize r_φ while becoming increasingly incoherent. The SFT model defines a prior of reasonable language; the KL penalty keeps the RLHF model within that prior.

Four Networks Running Simultaneously

PPO-based RLHF requires maintaining four models in memory:

  1. Actor (policy): The LLM being trained (π_θ)
  2. Critic: A value function estimating expected future reward (often another LLM copy)
  3. Reward model: Fixed network scoring outputs (r_φ)
  4. Reference policy: Frozen SFT model for KL computation (π_sft)

This is why PPO-based RLHF is 4× more memory-intensive than SFT, and a primary motivation for DPO.

Why PPO Over TRPO

PropertyTRPOPPO
ConstraintHard KL constraintSoft clipping
OptimizerConjugate gradient + line searchAny first-order (Adam)
ImplementationComplexSimple
Compute per updateHigh (Fisher matrix)Low
StabilitySlightly better theoreticallyEmpirically equivalent
LLM applicabilityRequires custom implementationOff-the-shelf

TRPO’s hard constraint is more theoretically principled but requires computing natural gradient steps — expensive and complex. PPO achieves similar empirical stability through the simpler clipping mechanism, which scales to the multi-billion-parameter regime.

Hyperparameter Sensitivity

PPO is notoriously sensitive:

  • Clip ε: Too small → slow learning (over-constrained); too large → instability
  • KL penalty β: Too small → reward hacking; too large → ignores reward signal
  • Number of PPO epochs per batch: More epochs → overfitting to old trajectories; fewer → inefficient
  • Critic learning rate: Must be tuned relative to actor LR; critic too slow → bad advantage estimates

InstructGPT used ε=0.2 (standard PPO), β varied across runs, and reported that hyperparameter sensitivity was one of the primary practical challenges.

Key Sources

  • rlhf — the pipeline in which PPO serves as the optimization step
  • alignment — PPO’s role in the broader alignment strategy
  • reward-model — the network PPO optimizes against
  • sft — SFT model is the reference policy for KL regularization
  • dpo — DPO eliminates the need for PPO entirely by reformulating RLHF as supervised learning

Open Questions

  • Can PPO be replaced by simpler online RL methods for RLHF without quality loss?
  • How does the KL penalty interact with reward model quality — is a weak RM more dangerous with a weak KL constraint?
  • Optimal PPO epoch count for LLM fine-tuning (empirically 1-4, but poorly understood)
  • Whether group relative policy optimization (GRPO, used in DeepSeek-R1) supersedes PPO for reasoning tasks