Policy Gradient

What It Is

A class of reinforcement learning algorithms that directly optimize a parameterized policy $π_{θ}$ by computing the gradient of expected reward with respect to the policy parameters — updating toward actions that led to higher-than-expected outcomes.

Why It Matters

Policy gradient methods are the foundation of RLHF. Unlike value-based methods (Q-learning), they work directly in high-dimensional action spaces like token vocabularies — making them the natural choice for fine-tuning language models with reward signals.

How It Works

The core objective is to maximize expected reward: $J (θ) = E_{τ \sim π_{θ}} [R (τ)]$ . The policy gradient theorem gives the gradient:

$\nabla_{θ} J (θ) = E_{t} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot \hat{A}_{t}]$

where $\hat{A}_{t}$ is the advantage estimate — how much better this action was than expected. Actions with positive advantage get higher probability; actions with negative advantage get lower probability. The challenge is that naive policy gradient updates can be too large, destabilizing training — motivating constrained variants like TRPO and clipped variants like PPO.

ML Wiki

Explorer

Policy Gradient

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Policy Gradient

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks