DeepSeek-R1: Incentivizing Reasoning via Reinforcement Learning

January 2025. DeepSeek releases a model that matches OpenAI’s o1 on math and coding benchmarks — and open-sources it. The ML community goes into shock, partly because of the quality, and partly because of what the training recipe revealed: you don’t need carefully curated chains-of-thought written by humans to teach a model to reason. You just need a reward signal and reinforcement learning. The model will figure out how to think.

The core idea

The analogy: You want to teach someone to solve chess puzzles. One approach: show them thousands of annotated solutions with expert commentary on each move (“I chose Nf3 here because it controls the center and threatens…”). Another approach: put them in a room with a chessboard and tell them “you get a point when you solve the puzzle correctly.” With enough time and practice, they discover the reasoning strategies themselves.

DeepSeek-R1 argues for the second approach — and shows it works at LLM scale. Previous reasoning models (including OpenAI’s o1) were trained on “process reward” data: human-annotated step-by-step solutions that showed the model how to think, not just whether the final answer was right. DeepSeek-R1 shows you can skip that, using pure RL with outcome rewards (correct/incorrect) and a format reward (thinking in a structured way) to elicit sophisticated reasoning behaviors.

The training pipeline

DeepSeek-R1 has two major variants: R1-Zero (pure RL from base model, no SFT) and R1 (the full pipeline).

DeepSeek-R1-Zero (the experiment that changed everything):

Start with DeepSeek-V3-Base (a strong pretrained LLM). Apply RL with two rewards:

Accuracy reward $r_{acc}$ : is the final answer correct? (checked against ground truth for math, unit tests for code)
Format reward $r_{fmt}$ : does the response use <think>...</think> tags for reasoning before giving the final answer?

No human-labeled reasoning traces. No step-by-step guidance. Just: “here’s a problem, think, answer, and we’ll tell you if you got it right.”

What emerged:

“We observe that DeepSeek-R1-Zero naturally acquires the ability to allocate more thinking time by reevaluating and reflecting upon its thinking process.”

The model spontaneously developed behaviors that OpenAI’s team had needed to carefully engineer:

Self-verification: the model starts double-checking its own work
Backtracking: when a reasoning path fails, it backs up and tries another approach
Dynamic effort allocation: harder problems get more reasoning tokens

The “aha moment” is documented in the paper: at a certain point in training, R1-Zero’s responses suddenly become much longer, and it starts using phrases like “Wait, let me reconsider” and “I made an error earlier.” These behaviors were not taught — they emerged from the RL signal.

The full DeepSeek-R1 pipeline:

R1-Zero works but has problems: the reasoning is sometimes in mixed languages (Chinese/English randomly switching), and it can be hard to read. The full R1 pipeline adds structure:

Cold start SFT: collect a small amount (~thousands) of “long chain-of-thought” examples written in the desired format, fine-tune the base model
Reasoning-focused RL: run GRPO (see grpo-deepseekmath-group-relative-policy-optimization) with accuracy + format rewards
Rejection sampling + SFT: generate many solutions per problem, keep only correct ones, fine-tune again
Multi-task RL: continue RL with both reasoning tasks and general helpfulness tasks, to prevent “alignment tax”

The GRPO algorithm (Group Relative Policy Optimization) is central to steps 2 and 4 — it’s described in the DeepSeekMath paper and computes policy gradients from groups of sampled responses without needing a trained value model. The total reward is $r = r_{acc} + r_{fmt}$ , and GRPO optimizes $π_{θ}$ via the objective:

$L_{GRPO} = E [min (r_{i} (θ) \hat{A}_{i}, clip (r_{i} (θ), 1 - ε, 1 + ε) \hat{A}_{i})] - β KL (π_{θ} ∥ π_{ref})$

where $\hat{A}_{i} = (r_{i} - mean (r)) / std (r)$ is the group-normalized advantage and $β$ controls the KL penalty toward the reference policy.

Find the instinct

Why does RL elicit reasoning when SFT doesn’t?

Supervised fine-tuning teaches the model to imitate the reasoning in the training data. If you show it 1000 solved math problems, it learns to generate text that looks like how those problems were solved. But it’s pattern-matching the form, not necessarily learning to reason.

RL with outcome rewards creates a different pressure: the model must actually solve the problem to get the reward. It can’t pattern-match its way to a correct answer if the problem is genuinely hard and the answer is checked. So the model is forced to develop genuine problem-solving strategies.

The key enabling factor: scale. Smaller models subjected to the same RL pressure don’t develop these emergent reasoning behaviors — they just memorize or fail. But at the scale of DeepSeek-V3-Base (a MoE model with $\sim$ 670B total parameters), the model has enough representational capacity to actually develop new cognitive strategies under RL pressure.

“The reasoning abilities of LLMs can be incentivized through pure reinforcement learning, obviating the need for human-labeled reasoning trajectories.”

The format reward matters more than it looks:

The <think>...</think> format reward might seem like a cosmetic constraint. It’s not. By requiring the model to write out its reasoning before giving the final answer, the format reward:

Forces the model to dedicate tokens to thinking (more compute at inference time → better answers)
Creates a structure the reward model can parse
Enables the reasoning trace to be used for distillation into smaller models

Performance

On mathematical reasoning (AIME 2024, competition math):

DeepSeek-R1: 79.8% pass@1
OpenAI o1-1217: 79.2%

On coding (Codeforces, competitive programming):

R1 reaches the 96.3th percentile of human Codeforces ratings

On MATH-500 (standard math benchmark):

R1: 97.3% (vs o1: 96.4%)

On general capability (MMLU, GPQA):

R1 is competitive with Claude 3.5 Sonnet and GPT-4o on general benchmarks

Crucially: open weights, openly described training pipeline. This enabled the community to fine-tune R1 variants and distill reasoning into smaller models (1.5B, 7B, 14B, 32B versions were released).

Practical implications

Distillation works astonishingly well. The long reasoning traces R1 generates can be used as training data for smaller models. DeepSeek-R1-Distill-Qwen-7B, trained on R1’s outputs, outperforms GPT-4o on math despite being ~20× smaller. The reasoning patterns, once discovered by the large model, transfer efficiently via imitation.

Test-time compute scaling. R1 demonstrates that inference-time reasoning (longer thinking traces) trades compute for quality. This opened a new scaling axis: instead of only scaling training compute, you can scale inference compute, and performance improves predictably.

When to use reasoning models: Problems with verifiable correct answers (math, code, formal logic). Problems that benefit from multi-step reasoning. Not recommended for tasks where “thinking out loud” adds no value (simple lookups, classification).

Connections

reasoning-rl — the approach this paper introduces and validates at scale
grpo — the RL algorithm (GRPO) used in DeepSeek-R1 training
chain-of-thought — CoT prompting was the precursor; R1 internalizes CoT via RL
rlhf — R1’s RL pipeline extends RLHF ideas with outcome-based rewards
grpo-deepseekmath-group-relative-policy-optimization — the RL training algorithm used in DeepSeek-R1
chain-of-thought-prompting — CoT prompting was the precursor; R1 internalizes CoT via RL

Citation

arXiv:2501.12948

DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature 645, 633–638 (2025). https://arxiv.org/abs/2501.12948

ML Wiki

Explorer