Self-Consistency Improves Chain of Thought Reasoning in Language Models

Concepts: chain-of-thought | in-context-learning | sampling | reasoning-rl Builds on: chain-of-thought-prompting Leads to: tree-of-thoughts-deliberate-problem-solving

Chain-of-thought prompting was a breakthrough: showing a model a few worked reasoning examples gets it to reason step by step, dramatically improving performance on hard problems. But there was an assumption nobody questioned — that you’d use the model’s first attempt. Wang et al. asked: what if you sampled the reasoning process many times and took a vote? No extra model, no training, no verifier. The answer turned out to be +17.9% on GSM8K, +11% on SVAMP, consistent improvements across every benchmark they tested.

The core idea

The analogy: Imagine a math class of 20 students taking an exam. Ask them all the same problem. Some will make arithmetic errors. Some will take longer routes. But the most common final answer across the class is very likely correct — because there are many independent ways to arrive at 11, but only one idiosyncratic way to arrive at 7 by mistake. The crowd’s consensus outperforms any individual student.

Self-consistency does exactly this with LLMs. The “students” are the same model sampled repeatedly at a nonzero temperature. Each sample is an independent reasoning chain. The final answer is determined by majority vote across all chains.

The paper’s central claim: “a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.” Correct answers cluster. Wrong answers scatter. Vote on the cluster.

The mechanism, step by step

Standard chain-of-thought prompting:

Build a few-shot prompt with worked examples (question → reasoning → answer)
Append the test question
Run greedy decoding: pick the most probable token at each step
Return the single output answer

Self-consistency replaces steps 3–4:

Same few-shot prompt
Sample $k$ reasoning chains at temperature $T \approx 0.7$ (typically $k = 20$ – $40$ )
Extract the final answer $a_{i}$ from each chain $i$
Majority vote:

$\overset{a}{^} = a ar g max \sum_{i = 1}^{k} 1 [a_{i} = a]$

No learned verifier. No re-ranker. No extra parameters. Just count answers.

GREEDY CoT:                              SELF-CONSISTENCY (k=5):

Prompt → [LLM]                           Prompt → [LLM, T=0.7] → "...= 11"  ─┐
            ↓                                    → [LLM, T=0.7] → "...= 11"  ─┤
    "5+6=11. Answer: 11"                         → [LLM, T=0.7] → "...= 11"  ─┼→ Vote → 11 ✓
            ↓                                    → [LLM, T=0.7] → "...= 7"   ─┤
   (if this run erred: wrong forever)            → [LLM, T=0.7] → "...= 11"  ─┘
                                           diverse paths, consistent answer wins

Why this works — the Bayesian view:

Self-consistency approximates marginalizing over reasoning paths:

$p (a ∣ q) \approx \frac{1}{k} \sum_{i = 1}^{k} 1 [f (r_{i}) = a]$

where $r_{i}$ is the $i$ -th sampled reasoning chain and $f (r_{i})$ extracts its final answer. You’re estimating the posterior probability of each answer by averaging across many independent reasoning paths. The most probable answer wins.

Numeric walkthrough — GSM8K-style example:

Question: “Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many does he have now?”

Sample k=5 chains:

Chain 1: "5 balls. 2 cans × 3 = 6 new. 5 + 6 = 11."              → 11
Chain 2: "5 original + (2 × 3) = 5 + 6 = 11 balls."              → 11
Chain 3: "Roger starts with 5. Buys 6 more. 5 + 6 = 11."         → 11
Chain 4: "5 + 2 = 7 balls."  (error: counted cans as balls)       →  7
Chain 5: "Each can: 3. Two cans: 3+3=6. Total: 5+6=11."          → 11

Vote tally:   11 → 4 votes   |   7 → 1 vote
Self-consistency selects: 11 ✓
Greedy decoding (single run): might return 7 if that error path
                               had higher local token probability

The correct answer (11) appears in 80% of chains. The wrong answer (7) is idiosyncratic — one particular way to misread “cans” — while there are many valid arithmetic paths to 11.

Find the instinct

Why does greedy decoding fail on reasoning tasks at all?

Reasoning chains are long and sequential. Each step is conditioned on everything before it. A locally plausible wrong token early in the chain (“5 + 2 = 7”) gets committed to, and every subsequent token now builds on that mistake. Greedy decoding optimizes locally at each step — but local optima compound into global failures.

The key insight: for tasks with a unique correct answer, diverse random samples of the reasoning process are weakly correlated in their errors but strongly correlated in their correct paths. Wrong paths are idiosyncratic; right paths converge. This is exactly the condition under which majority voting is reliable.

Why didn’t people try this immediately after chain-of-thought? Two reasons. First, CoT itself was so new that the immediate reaction was “this works!” rather than “what if we ran it 40 times?” Second, it only works cleanly when final answers can be compared as discrete symbols — numbers, letters, multiple-choice labels. Open-ended generation has no natural “most consistent” answer to vote on.

Results

PaLM 540B + 8-shot CoT, self-consistency with k=40 samples:

Benchmark	Type	Greedy CoT	Self-Consistency	Δ
GSM8K	Math (grade school)	56.9%	74.4%	+17.5%
SVAMP	Math (robust)	79.0%	89.5%	+10.5%
AQuA	Algebra MCQ	35.9%	48.0%	+12.1%
StrategyQA	Commonsense	65.4%	71.8%	+6.4%
ARC-challenge	Science QA	81.0%	84.9%	+3.9%

Gains are largest on the hardest tasks — the ones where greedy CoT most often makes compounding errors. ARC-challenge is easy enough that greedy rarely fails catastrophically, so there’s less room to recover.

How many samples do you need?

Returns are steep from k=1 to k=10, then flatten. With PaLM on GSM8K:

k=1 (greedy): 56.9%
k=5: ~68%
k=10: ~72%
k=40: 74.4%

Most of the gain is captured by k=10–20. Going to k=40 adds ~2 more points at 4× the cost of k=10.

What breaks it:

Open-ended tasks: you can’t majority-vote free text
Systematic model bias: if the model always makes the same mistake, sampling more won’t help
Temperature too high: chains become incoherent, and voting on noise is useless
Tasks where final answers aren’t extractable as discrete symbols

Practical implications

If you’re doing chain-of-thought prompting on any structured reasoning task — math, code correctness, multi-step logic, classification — self-consistency is the first thing to try before reaching for a more expensive solution. The implementation is simple: sample k completions at T≈0.7, parse the final answer from each, take the mode. No new models, no training data, no infrastructure.

The cost tradeoff is linear: k=20 costs 20× a single greedy run. For high-stakes tasks (financial calculation, code verification, medical reasoning), this is almost always worth it. For interactive applications where latency matters, k=5 captures most of the gain at 5× cost.

One underrated benefit: self-consistency gives calibration for free. If 39/40 chains agree, you can trust the answer. If 12/40 agree, the model is uncertain — treat the output accordingly. Greedy decoding gives no signal about confidence.

The line of work this paper opened: chain-of-thought-prompting showed that reasoning paths help. Self-consistency showed that sampling reasoning paths helps more than committing to one. tree-of-thoughts-deliberate-problem-solving took the next step — instead of sampling paths randomly, search over them deliberately with lookahead and backtracking. The progression: single path → random ensemble → guided search.

Connections

chain-of-thought — self-consistency is a decoding strategy that layers on top of CoT prompting
in-context-learning — works purely via prompting, no weights updated
sampling — relies on stochastic sampling at temperature > 0 for path diversity
reasoning-rl — RL-for-reasoning work (GRPO, DeepSeek-R1) trains models to internalize what self-consistency achieves at inference time
chain-of-thought-prompting — the foundation this paper extends
tree-of-thoughts-deliberate-problem-solving — successor: makes the path search deliberate rather than random

Citation

arXiv:2203.11171

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. https://arxiv.org/abs/2203.11171

ML Wiki

Explorer