Concepts: chain-of-thought | in-context-learning | sampling | reasoning-rl Builds on: chain-of-thought-prompting Leads to: tree-of-thoughts-deliberate-problem-solving
Chain-of-thought prompting was a breakthrough: showing a model a few worked reasoning examples gets it to reason step by step, dramatically improving performance on hard problems. But there was an assumption nobody questioned — that you’d use the model’s first attempt. Wang et al. asked: what if you sampled the reasoning process many times and took a vote? No extra model, no training, no verifier. The answer turned out to be +17.9% on GSM8K, +11% on SVAMP, consistent improvements across every benchmark they tested.
The core idea
The analogy: Imagine a math class of 20 students taking an exam. Ask them all the same problem. Some will make arithmetic errors. Some will take longer routes. But the most common final answer across the class is very likely correct — because there are many independent ways to arrive at 11, but only one idiosyncratic way to arrive at 7 by mistake. The crowd’s consensus outperforms any individual student.
Self-consistency does exactly this with LLMs. The “students” are the same model sampled repeatedly at a nonzero temperature. Each sample is an independent reasoning chain. The final answer is determined by majority vote across all chains.
The paper’s central claim: “a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.” Correct answers cluster. Wrong answers scatter. Vote on the cluster.
The mechanism, step by step
Standard chain-of-thought prompting:
- Build a few-shot prompt with worked examples (question → reasoning → answer)
- Append the test question
- Run greedy decoding: pick the most probable token at each step
- Return the single output answer
Self-consistency replaces steps 3–4:
- Same few-shot prompt
- Sample reasoning chains at temperature (typically –)
- Extract the final answer from each chain
- Majority vote:
No learned verifier. No re-ranker. No extra parameters. Just count answers.
GREEDY CoT: SELF-CONSISTENCY (k=5):
Prompt → [LLM] Prompt → [LLM, T=0.7] → "...= 11" ─┐
↓ → [LLM, T=0.7] → "...= 11" ─┤
"5+6=11. Answer: 11" → [LLM, T=0.7] → "...= 11" ─┼→ Vote → 11 ✓
↓ → [LLM, T=0.7] → "...= 7" ─┤
(if this run erred: wrong forever) → [LLM, T=0.7] → "...= 11" ─┘
diverse paths, consistent answer wins
Why this works — the Bayesian view:
Self-consistency approximates marginalizing over reasoning paths:
where is the -th sampled reasoning chain and extracts its final answer. You’re estimating the posterior probability of each answer by averaging across many independent reasoning paths. The most probable answer wins.
Numeric walkthrough — GSM8K-style example:
Question: “Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many does he have now?”
Sample k=5 chains:
Chain 1: "5 balls. 2 cans × 3 = 6 new. 5 + 6 = 11." → 11
Chain 2: "5 original + (2 × 3) = 5 + 6 = 11 balls." → 11
Chain 3: "Roger starts with 5. Buys 6 more. 5 + 6 = 11." → 11
Chain 4: "5 + 2 = 7 balls." (error: counted cans as balls) → 7
Chain 5: "Each can: 3. Two cans: 3+3=6. Total: 5+6=11." → 11
Vote tally: 11 → 4 votes | 7 → 1 vote
Self-consistency selects: 11 ✓
Greedy decoding (single run): might return 7 if that error path
had higher local token probability
The correct answer (11) appears in 80% of chains. The wrong answer (7) is idiosyncratic — one particular way to misread “cans” — while there are many valid arithmetic paths to 11.
Find the instinct
Why does greedy decoding fail on reasoning tasks at all?
Reasoning chains are long and sequential. Each step is conditioned on everything before it. A locally plausible wrong token early in the chain (“5 + 2 = 7”) gets committed to, and every subsequent token now builds on that mistake. Greedy decoding optimizes locally at each step — but local optima compound into global failures.
The key insight: for tasks with a unique correct answer, diverse random samples of the reasoning process are weakly correlated in their errors but strongly correlated in their correct paths. Wrong paths are idiosyncratic; right paths converge. This is exactly the condition under which majority voting is reliable.
Why didn’t people try this immediately after chain-of-thought? Two reasons. First, CoT itself was so new that the immediate reaction was “this works!” rather than “what if we ran it 40 times?” Second, it only works cleanly when final answers can be compared as discrete symbols — numbers, letters, multiple-choice labels. Open-ended generation has no natural “most consistent” answer to vote on.
Results
PaLM 540B + 8-shot CoT, self-consistency with k=40 samples:
| Benchmark | Type | Greedy CoT | Self-Consistency | Δ |
|---|---|---|---|---|
| GSM8K | Math (grade school) | 56.9% | 74.4% | +17.5% |
| SVAMP | Math (robust) | 79.0% | 89.5% | +10.5% |
| AQuA | Algebra MCQ | 35.9% | 48.0% | +12.1% |
| StrategyQA | Commonsense | 65.4% | 71.8% | +6.4% |
| ARC-challenge | Science QA | 81.0% | 84.9% | +3.9% |
Gains are largest on the hardest tasks — the ones where greedy CoT most often makes compounding errors. ARC-challenge is easy enough that greedy rarely fails catastrophically, so there’s less room to recover.
How many samples do you need?
Returns are steep from k=1 to k=10, then flatten. With PaLM on GSM8K:
- k=1 (greedy): 56.9%
- k=5: ~68%
- k=10: ~72%
- k=40: 74.4%
Most of the gain is captured by k=10–20. Going to k=40 adds ~2 more points at 4× the cost of k=10.
What breaks it:
- Open-ended tasks: you can’t majority-vote free text
- Systematic model bias: if the model always makes the same mistake, sampling more won’t help
- Temperature too high: chains become incoherent, and voting on noise is useless
- Tasks where final answers aren’t extractable as discrete symbols
Practical implications
If you’re doing chain-of-thought prompting on any structured reasoning task — math, code correctness, multi-step logic, classification — self-consistency is the first thing to try before reaching for a more expensive solution. The implementation is simple: sample k completions at T≈0.7, parse the final answer from each, take the mode. No new models, no training data, no infrastructure.
The cost tradeoff is linear: k=20 costs 20× a single greedy run. For high-stakes tasks (financial calculation, code verification, medical reasoning), this is almost always worth it. For interactive applications where latency matters, k=5 captures most of the gain at 5× cost.
One underrated benefit: self-consistency gives calibration for free. If 39/40 chains agree, you can trust the answer. If 12/40 agree, the model is uncertain — treat the output accordingly. Greedy decoding gives no signal about confidence.
The line of work this paper opened: chain-of-thought-prompting showed that reasoning paths help. Self-consistency showed that sampling reasoning paths helps more than committing to one. tree-of-thoughts-deliberate-problem-solving took the next step — instead of sampling paths randomly, search over them deliberately with lookahead and backtracking. The progression: single path → random ensemble → guided search.
Connections
- chain-of-thought — self-consistency is a decoding strategy that layers on top of CoT prompting
- in-context-learning — works purely via prompting, no weights updated
- sampling — relies on stochastic sampling at temperature > 0 for path diversity
- reasoning-rl — RL-for-reasoning work (GRPO, DeepSeek-R1) trains models to internalize what self-consistency achieves at inference time
- chain-of-thought-prompting — the foundation this paper extends
- tree-of-thoughts-deliberate-problem-solving — successor: makes the path search deliberate rather than random
Citation
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. https://arxiv.org/abs/2203.11171