What It Is

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm for training language models that eliminates the value model (critic) required by PPO. Instead of estimating advantage using a trained value function, GRPO generates a group of responses for each prompt, then computes relative advantages by normalizing each response’s reward against the group’s mean and standard deviation. Introduced in DeepSeekMath (Shao et al., 2024) and used as the primary RL algorithm in DeepSeek-R1.

Why It Matters

PPO-based RLHF requires four large models in memory simultaneously: the policy, the critic (value model, typically same size as policy), the reference policy, and the reward model. For a 7B model, this is effectively 28B+ parameters. GRPO eliminates the critic, reducing to three models. For 70B models, this difference is decisive. Beyond memory savings, GRPO turns out to achieve comparable or better performance on mathematical reasoning tasks — the group-relative advantage estimate is often more accurate than the biased estimates from an early-stage value model.

The Mechanism

Standard PPO advantage estimation requires a value model V(s) to compute:

A_t = r_t + γV(s_{t+1}) - V(s_t)

This estimates how much better a given action was compared to the expected baseline. The value model must be trained simultaneously, adding memory, compute, and a second optimization problem.

GRPO advantage estimation replaces V(s) with group statistics:

For each prompt q, sample G responses {o_1, …, o_G} from the current policy. Score each with a reward model or rule-based check to get {r_1, …, r_G}. Compute advantages by group standardization:

Â_i = (r_i - mean(r_1,…,r_G)) / std(r_1,…,r_G)

The GRPO objective (per-token, averaged over the group):

L_GRPO = E[min(r_i(θ) · Â_i, clip(r_i(θ), 1-ε, 1+ε) · Â_i)] - β · KL(π_θ || π_ref)

where r_i(θ) = π_θ(o_i|q) / π_θ_old(o_i|q) is the probability ratio (same as PPO), and β is the KL penalty weight.

GRPO ADVANTAGE COMPUTATION (G=4 responses):
  Response 1: reward = 0.9
  Response 2: reward = 0.3
  Response 3: reward = 0.8
  Response 4: reward = 0.2

  mean = 0.55, std = 0.32

  Â_1 = (0.9 - 0.55) / 0.32 = +1.09  ← reinforce this response
  Â_2 = (0.3 - 0.55) / 0.32 = -0.78  ← discourage this response
  Â_3 = (0.8 - 0.55) / 0.32 = +0.78  ← reinforce
  Â_4 = (0.2 - 0.55) / 0.32 = -1.09  ← strongly discourage

The policy is updated to increase probability of above-average responses and decrease probability of below-average responses within each group.

Why Group Normalization Works

The group mean is an unbiased baseline: it represents what this specific model at this training step actually produces for this type of problem. This avoids the early-training problem with value models: a newly initialized critic has high bias and poor accuracy, producing noisy advantage estimates. GRPO’s empirical baseline is always calibrated to the current policy’s actual behavior.

Group normalization also naturally adapts to reward scale. If a batch of easy problems all get reward ≈ 1.0, the standardized advantages will be near zero — the policy receives little gradient signal, which is correct (the model already solves these problems). Hard problems where rewards vary from 0 to 1 produce large advantage differences and strong learning signal.

Reward Design for GRPO

GRPO works best with verifiable rewards:

  • Math: check final answer against ground truth (+1 correct, 0 incorrect)
  • Code: execute against unit tests (fraction of tests passing)
  • Format: small reward for following required output format (e.g., using <think> tags)

Rule-based verifiable rewards are more stable than learned reward models during RL training. They don’t shift during training, don’t require a separate model, and can’t be “hacked” in the same way a learned RM can.

For tasks without verifiable rewards (creative writing, general helpfulness), a learned reward model is still needed, and GRPO degrades to using the RM’s scores as the group rewards.

Comparison to PPO

PropertyPPOGRPO
Value modelRequired (same size as policy)Not needed
Memory~4 models~3 models
Advantage estimationV(s) from trained criticGroup reward statistics
BiasHigh early in training (untrained critic)Low (empirical baseline)
Credit assignmentPer-tokenPer-response (treated as unit)
Best forGeneral RL tasksLLM generation with verifiable rewards

Usage in DeepSeek-R1

GRPO was the RL algorithm used in:

  1. DeepSeekMath (the paper that introduced GRPO): trained a 7B model to 51.7% on MATH — approaching GPT-4 performance
  2. DeepSeek-R1-Zero: pure RL from base model using GRPO with outcome rewards, producing emergent reasoning behaviors
  3. DeepSeek-R1 full pipeline: GRPO applied at scale to DeepSeek-V3-Base, enabling the reasoning capabilities that matched OpenAI o1

Key Sources

  • ppo — GRPO is a memory-efficient alternative; eliminates PPO’s value model
  • rlhf — GRPO is used in the RL stage of alignment pipelines
  • reasoning-rl — GRPO is the key algorithm enabling RL-based reasoning
  • reward-model — verifiable reward functions replace learned reward models for math/code

Open Questions

  • Does GRPO underperform PPO on tasks with non-verifiable rewards, where group reward variance is noisier?
  • What is the optimal group size G for different task types?
  • Can per-token credit assignment (PPO’s strength) be incorporated into a value-model-free framework?
  • Does GRPO’s advantage normalization cause any instability when all group responses get the same reward (zero variance)?