GRPO: Group Relative Policy Optimization (DeepSeekMath)

Training language models with reinforcement learning has historically required two separate models: the policy (the LLM being improved) and a critic or value model that estimates how good a given state is. In PPO (Proximal Policy Optimization), the value model is usually the same size as the policy — meaning RL training for a 7B parameter model requires running two 7B models simultaneously, plus the reference policy and reward model. That’s 4× the memory of just running the model. Group Relative Policy Optimization (GRPO), introduced in DeepSeekMath (Shao et al., 2024), eliminates the value model entirely, cutting memory roughly in half while achieving better or comparable performance on mathematical reasoning tasks. This is the training algorithm that powered DeepSeek-R1.

The core idea

The analogy: In competitive exam preparation, one way to evaluate whether a student is improving is to compare their performance to a baseline (are you doing better than last week?). A more efficient method: give the student the same problem multiple times with different approaches, and use relative performance across those attempts to assess which strategies are working. You don’t need a separate “student quality estimator” — you estimate quality directly from the group of attempts.

GRPO does exactly this for LLM training. Instead of maintaining a separate value model that estimates “how good is this response?”, GRPO:

Generates a group of G outputs for the same input
Scores each output with a reward model
Uses the group’s reward distribution to compute relative advantages (how much better/worse than the group average is each output?)
Trains the policy to increase probability of above-average outputs and decrease probability of below-average ones

No separate value model. The “critic” is replaced by within-group comparison.

The mechanism, step by step

Standard PPO recap:

PPO trains a policy $π_{θ}$ to maximize reward by optimizing:

$L_{PPO} = E [min (r_{t} (θ) \cdot A_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) \cdot A_{t})] - β \cdot KL (π_{θ} ∥ π_{ref})$

where:

$r_{t} (θ) = π_{θ} (a_{t} ∣ s_{t}) / π_{θ_{old}} (a_{t} ∣ s_{t})$ — probability ratio (how much the new policy differs from old)
$A_{t}$ — advantage estimate (requires a value model to compute)
KL term — penalty for deviating too far from the reference policy

The advantage $A_{t} = R_{t} - V (s_{t})$ , where $V (s_{t})$ is the value model’s estimate of expected return from state $s_{t}$ . Training the value model requires a separate forward/backward pass, adds parameters, and is a separate training problem.

GRPO:

For each question $q$ , sample $G$ outputs ${o_{1}, o_{2}, \dots, o_{G}}$ from the current policy $π_{θ}$ .

Compute rewards ${r_{1}, r_{2}, \dots, r_{G}}$ for each output (from a reward model or rule-based check).

Compute normalized advantages by group standardization:

$\hat{A}_{i} = \frac{r _{i} - mean ({ r _{1} , \dots , r _{G} })}{std ({ r _{1} , \dots , r _{G} })}$

Then optimize:

$L_{GRPO} = E [min (r_{i} (θ) \cdot \hat{A}_{i}, clip (r_{i} (θ), 1 - ε, 1 + ε) \cdot \hat{A}_{i})] - β \cdot KL (π_{θ} ∥ π_{ref})$

The advantage $\hat{A}_{i}$ is computed from the within-group reward distribution — no value model needed.

STANDARD PPO:
  One question → [Policy generates 1 response] → reward → advantage via VALUE MODEL
  Value model must be trained simultaneously
  Memory: 4 models (policy, value, reference, reward)

GRPO:
  One question → [Policy generates G=8 responses] → G rewards
                → advantage = (reward - group_mean) / group_std
  No value model
  Memory: 3 models (policy, reference, reward)

ADVANTAGE COMPUTATION EXAMPLE (G=4 responses):
  Response 1: reward = 0.9  above average  policy should increase its probability
  Response 2: reward = 0.3  below average  policy should decrease its probability
  Response 3: reward = 0.8  above average
  Response 4: reward = 0.2  below average

  mean = 0.55, std = 0.32
  A_hat_1 = (0.9 - 0.55) / 0.32 = +1.09
  A_hat_2 = (0.3 - 0.55) / 0.32 = -0.78
  A_hat_3 = (0.8 - 0.55) / 0.32 = +0.78
  A_hat_4 = (0.2 - 0.55) / 0.32 = -1.09

Why normalize within the group?

The group mean and std are natural baselines: they represent what this model at this training stage typically achieves on this type of problem. An advantage of $+ 1.0$ means “this response was 1 standard deviation above what the model usually produces” — a meaningful relative comparison.

This avoids a key PPO problem: the value model can have high bias in the early stages of training (it’s learning what “good” means as the policy improves). GRPO’s within-group normalization is unbiased: it uses the current model’s actual outputs as the reference.

Reward design:

For mathematical reasoning in DeepSeekMath, rewards are rule-based:

Accuracy reward: +1 if the final answer is correct (verified against ground truth), 0 otherwise
Format reward: small reward for following the expected output format (showing work, using LaTeX for math)

No learned reward model is needed for math — you can verify answers definitively. This is a crucial design choice: rule-based rewards are more stable than learned reward models during RL training.

Find the instinct

Why PPO is expensive for LLMs:

PPO was designed for RL in continuous action spaces (robotics, game-playing). Adapting it to LLMs is expensive because:

The “state” is the entire conversation so far (long context)
The “action” is the next token (vocabulary of 32K+ tokens)
The value model must process the entire sequence to estimate value at each token position
Training LLMs requires large batches for stability

The value model alone roughly doubles training memory. For a 7B model RL run, you need ~7B (policy) + ~7B (value) + ~7B (reference) + reward model ≈ 28B+ parameters in memory.

GRPO reduces this to ~21B (no value model). For 70B models, the savings are decisive.

The deeper insight: LLMs don’t need per-token value estimates:

PPO estimates advantages at every token position, reflecting the insight that in RL, each action (token) affects future rewards. But for language generation, the reward is typically assigned to the complete response, not individual tokens. The token-level credit assignment problem (which token “caused” the good/bad outcome?) is artificial and hard to solve accurately.

GRPO sidesteps credit assignment entirely by treating each response as a unit and asking “was this response better or worse than what else could have been generated?” This is a coarser signal but easier to estimate accurately.

Results

DeepSeekMath 7B (trained with GRPO):

On MATH (competition math benchmark):

DeepSeekMath 7B: 51.7% — approaching GPT-4 level performance from a 7B model
Previous SOTA open-source 7B math model: ~34%
GPT-4 baseline: ~52%

On GSM8K (grade school math):

DeepSeekMath 7B: 88.2%
Supervised fine-tuning (no RL): ~74%

The RL step (GRPO) adds roughly 8-15 points on math benchmarks over SFT alone. The improvement comes from the model learning to verify, retry, and reason more carefully — not just from memorizing solution patterns.

Ablations comparing GRPO to PPO:

The paper shows GRPO achieves similar or better benchmark performance vs. PPO while using significantly less memory and training time. The value model’s bias on complex math problems actually hurts PPO in some settings — GRPO’s simpler advantage estimation turns out to be more effective here.

GRPO in DeepSeek-R1:

DeepSeek-R1 (2025) uses GRPO as its primary RL training algorithm, scaled to the DeepSeek-V3-Base model. The same core algorithm that was validated on 7B math models turns out to work at 670B+ parameter scale, producing the emergent reasoning behaviors described in R1.

Practical implications

GRPO is now the preferred RL algorithm for LLM training when:

You have verifiable rewards (math, code execution, logic puzzles)
You want to minimize memory overhead vs. PPO
You’re training models where the response is evaluated as a unit (not per-token)

The combination of GRPO + rule-based rewards + a strong base model is the recipe that produced DeepSeek-R1’s reasoning capabilities. Researchers have reported success applying GRPO to other verifiable tasks: code generation (run the code, check if tests pass), formal verification, and structured prediction tasks.

For tasks without verifiable rewards (creative writing, general helpfulness), you still need a learned reward model, and the group-based advantage estimation may be noisier. In those settings, DPO or PPO with a good reward model may still be preferable.

Connections

grpo — the algorithm introduced in this paper
reasoning-rl — GRPO is the key RL algorithm enabling reasoning capability emergence
rlhf — GRPO is a memory-efficient alternative to the PPO used in standard RLHF
ppo — PPO is the baseline that GRPO improves upon by eliminating the value model
deepseek-r1-reasoning-via-reinforcement-learning — uses GRPO as its RL training algorithm
training-language-models-to-follow-instructions-with-human-feedback — the PPO-based RLHF baseline that GRPO improves upon in memory efficiency

Citation

arXiv:2402.03300

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint. https://arxiv.org/abs/2402.03300

ML Wiki

Explorer