Why this matters
DeepSeek-R1 showed that pure RL on verifiable correctness — math, code, logic — can produce frontier-grade reasoning, with chain-of-thought emerging from the optimization without explicit CoT supervision. If that recipe transfers to non-verifiable domains (open-ended writing, judgment, novel science), it’s the post-RLHF training paradigm. If it doesn’t — if the gains stay locked to domains with clean reward signals — then RL-on-verifiable-rewards is a powerful but bounded tool, not the universal recipe.
Current best understanding
(2026-04-28) The recipe demonstrably works wherever rewards can be checked automatically: math (final answer), code (unit tests pass), logic puzzles (formal verification), competitive programming (judge accept). It produces long, structured chains-of-thought as a side effect of the optimization.
Open question: how far does this transfer? Three positions in the field:
- Optimist: long CoT trained on verifiable rewards generalizes to non-verifiable domains because reasoning skill is fungible. The model learned how to think.
- Pessimist: the model learned to game specific reward signals. Off-distribution it confabulates structured-looking but vacuous reasoning.
- Realist (current best guess): partial transfer. Math/code RL improves general step-by-step reasoning modestly. Domain-specific reward design is still required for each new domain.
The strongest evidence right now is benchmark numbers, which mostly are verifiable domains, so the transfer question is hard to read directly off public results.
Evidence
- deepseek-r1-reasoning-via-reinforcement-learning — Demonstrates the pure-RL recipe; CoT emerges without explicit supervision.
- reasoning-rl — Aggregates the broader trend.
- chain-of-thought — CoT was originally a prompting technique; the RL recipe internalizes it.
- grpo — GRPO (used in DeepSeek-R1) is a PPO variant adapted for verifiable-reward settings.
What would settle it
- Held-out evals on non-verifiable domains (long-form writing quality, judgment calls, scientific reasoning), comparing R1-style models to instruction-tuned baselines, judged by humans rather than autograders.
- Process-supervision vs. outcome-supervision ablations: does verifiable-reward RL produce correct reasoning steps, or just steps that look correct and arrive at the right answer?
- Transfer studies: train RL on math + code only, test on legal reasoning, medical diagnosis, etc. — does the reasoning style transfer or only the surface form?