What It Is

RL for reasoning is the use of reinforcement learning with outcome-based rewards to train language models to develop extended, structured reasoning capabilities. Rather than teaching a model how to reason via supervised imitation of labeled chains-of-thought, RL-based reasoning training provides only a reward signal (correct/incorrect final answer) and lets the model discover effective reasoning strategies through trial and error. The result: models that allocate more inference-time compute to harder problems, spontaneously develop self-verification and backtracking behaviors, and generalize better than SFT-trained reasoning models.

Test-time compute scaling refers to the observation that with such models, spending more tokens on “thinking” before answering predictably improves accuracy — creating a new scaling axis orthogonal to training compute.

Why It Matters

Before DeepSeek-R1 (January 2025), the dominant view was that reasoning improvements required careful human annotation of reasoning traces (process supervision). R1 demonstrated that pure RL from outcome rewards, applied to a capable base model, is sufficient to elicit complex reasoning behaviors. This opens a path to improving reasoning without expensive human labeling and suggests that test-time compute is a viable complement to training-time compute.

The Core Insight: RL Forces Real Reasoning

Supervised fine-tuning teaches a model to imitate the format of reasoning in training data. If the training set shows “Step 1: compute X… Step 2: conclude Y…”, the model learns to generate text that looks like this pattern, regardless of whether the internal computation is actually solving the problem.

RL with outcome rewards creates a different pressure: the final answer is checked against ground truth. Pattern-matching to look like reasoning doesn’t work when answers are verified. The model must actually solve the problem. Under sufficient training pressure at sufficient scale, models develop:

  • Self-verification: double-checking intermediate results
  • Backtracking: abandoning failed reasoning paths and trying alternatives
  • Dynamic effort allocation: producing longer reasoning traces for harder problems
  • Error detection: recognizing when an intermediate conclusion conflicts with something computed earlier

These behaviors were not in the prompt or training format — they emerged from the optimization pressure.

Why Scale Matters

The emergent reasoning behaviors only appear in large enough models. At 1.5B parameters, applying the same RL procedure produces no meaningful improvement. At 7B, there is modest gain. At the scale of DeepSeek-V3-Base (670B+ MoE), the full suite of reasoning behaviors emerges.

The intuition: the model needs sufficient representational capacity to actually develop new cognitive strategies under RL pressure, rather than simply memorizing or collapsing. This is analogous to phase transitions in scaling: capability suddenly appears past a threshold.

Test-Time Compute Scaling

A critical empirical finding from OpenAI o1 and DeepSeek-R1: more reasoning tokens → better answers, and this relationship is smooth and predictable.

EASY QUESTION: "What is 15% of 80?"
  Short reasoning (few tokens): correct
  Long reasoning: also correct, no benefit

HARD PROBLEM: Olympiad math problem
  Short reasoning (100 tokens): ~20% correct
  Medium reasoning (500 tokens): ~50% correct
  Long reasoning (2000+ tokens): ~80% correct

This creates a new resource tradeoff: at inference time, you can spend more compute on harder problems and less on easier ones. The model can allocate its “thinking budget” dynamically.

Scaling test-time compute is distinct from scaling training compute:

  • Training compute scaling: bigger model or more data → better general capability (Chinchilla scaling laws)
  • Test-time compute scaling: longer reasoning chain for a specific problem → better answer to that problem

Both axes are valuable. The combined insight: you can improve performance on hard problems at inference time without retraining.

Training Pipeline

The standard RL reasoning pipeline (from DeepSeek-R1):

  1. Cold start SFT (optional): a small set of human-written long-chain-of-thought examples sets the format. Without this, reasoning may be inconsistent or in mixed languages.

  2. Reasoning-focused RL: run GRPO (or PPO) with:

    • Accuracy reward: +1 for correct final answer, 0 otherwise (verified against ground truth for math; executed against unit tests for code)
    • Format reward: small reward for using the required reasoning format (e.g., <think>...</think> tags)
    • No human-labeled intermediate steps required
  3. Rejection sampling + SFT: generate many solutions per problem, keep only correct ones, fine-tune again. This stabilizes behavior after RL.

  4. Multi-task RL: continue RL with mixed reasoning + helpfulness tasks, to prevent over-specialization.

The Format Reward’s Role

Requiring the model to write out reasoning in <think> tags before answering is not merely cosmetic:

  1. Forces the model to dedicate tokens to thinking (longer reasoning → better answers)
  2. Creates a structured trace usable for distillation into smaller models
  3. Makes the reasoning process auditable (you can see what the model “considered”)

Distillation from Reasoning Models

Reasoning capabilities discovered by large RL-trained models can be distilled into much smaller models via imitation learning:

  • Generate reasoning traces from the large model on a training set
  • Fine-tune a smaller model on those traces (SFT on the large model’s “thinking”)
  • The small model inherits reasoning patterns without needing RL training itself

DeepSeek-R1-Distill-Qwen-7B (trained on R1 outputs) outperforms GPT-4o on math benchmarks despite being ~20× smaller. The patterns, once discovered, transfer efficiently.

Key Sources

  • grpo — the RL algorithm commonly used in reasoning training pipelines
  • rlhf — RL reasoning training extends RLHF ideas to outcome-based reasoning rewards
  • chain-of-thought — CoT prompting is the inference-time precursor; RL reasoning internalizes it via training
  • ppo — the baseline RL algorithm; GRPO is preferred for reasoning tasks
  • sft — cold-start SFT and rejection-sampling SFT are part of the full R1 pipeline
  • scaling-laws — test-time compute scaling is a new axis complementary to training-time scaling

Open Questions

  • Does RL-based reasoning generalize beyond math and code to domains without verifiable rewards?
  • What is the optimal ratio of test-time compute to training compute for a given quality target?
  • Can reasoning capabilities be reliably elicited from models smaller than 7B?
  • Do the emergent reasoning behaviors (self-verification, backtracking) represent genuine cognitive strategies or sophisticated pattern matching?