What It Is

A class of reinforcement learning algorithms that directly optimize a parameterized policy by computing the gradient of expected reward with respect to the policy parameters — updating toward actions that led to higher-than-expected outcomes.

Why It Matters

Policy gradient methods are the foundation of RLHF. Unlike value-based methods (Q-learning), they work directly in high-dimensional action spaces like token vocabularies — making them the natural choice for fine-tuning language models with reward signals.

How It Works

The core objective is to maximize expected reward: . The policy gradient theorem gives the gradient:

where is the advantage estimate — how much better this action was than expected. Actions with positive advantage get higher probability; actions with negative advantage get lower probability. The challenge is that naive policy gradient updates can be too large, destabilizing training — motivating constrained variants like TRPO and clipped variants like PPO.

Key Sources