What It Is
A class of reinforcement learning algorithms that directly optimize a parameterized policy by computing the gradient of expected reward with respect to the policy parameters — updating toward actions that led to higher-than-expected outcomes.
Why It Matters
Policy gradient methods are the foundation of RLHF. Unlike value-based methods (Q-learning), they work directly in high-dimensional action spaces like token vocabularies — making them the natural choice for fine-tuning language models with reward signals.
How It Works
The core objective is to maximize expected reward: . The policy gradient theorem gives the gradient:
where is the advantage estimate — how much better this action was than expected. Actions with positive advantage get higher probability; actions with negative advantage get lower probability. The challenge is that naive policy gradient updates can be too large, destabilizing training — motivating constrained variants like TRPO and clipped variants like PPO.
Key Sources
- proximal-policy-optimization
- grpo-deepseekmath-group-relative-policy-optimization
- learning-to-summarize-human-feedback