Concepts: chain-of-thought | in-context-learning | sampling | reasoning-rl Builds on: chain-of-thought-prompting Leads to: tree-of-thoughts-deliberate-problem-solving

Chain-of-thought prompting was a breakthrough: showing a model a few worked reasoning examples gets it to reason step by step, dramatically improving performance on hard problems. But there was an assumption nobody questioned — that you’d use the model’s first attempt. Wang et al. asked: what if you sampled the reasoning process many times and took a vote? No extra model, no training, no verifier. The answer turned out to be +17.9% on GSM8K, +11% on SVAMP, consistent improvements across every benchmark they tested.

The core idea

The analogy: Imagine a math class of 20 students taking an exam. Ask them all the same problem. Some will make arithmetic errors. Some will take longer routes. But the most common final answer across the class is very likely correct — because there are many independent ways to arrive at 11, but only one idiosyncratic way to arrive at 7 by mistake. The crowd’s consensus outperforms any individual student.

Self-consistency does exactly this with LLMs. The “students” are the same model sampled repeatedly at a nonzero temperature. Each sample is an independent reasoning chain. The final answer is determined by majority vote across all chains.

The paper’s central claim: “a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.” Correct answers cluster. Wrong answers scatter. Vote on the cluster.

The mechanism, step by step

Standard chain-of-thought prompting:

  1. Build a few-shot prompt with worked examples (question → reasoning → answer)
  2. Append the test question
  3. Run greedy decoding: pick the most probable token at each step
  4. Return the single output answer

Self-consistency replaces steps 3–4:

  1. Same few-shot prompt
  2. Sample reasoning chains at temperature (typically )
  3. Extract the final answer from each chain
  4. Majority vote:

No learned verifier. No re-ranker. No extra parameters. Just count answers.

GREEDY CoT:                              SELF-CONSISTENCY (k=5):

Prompt → [LLM]                           Prompt → [LLM, T=0.7] → "...= 11"  ─┐
            ↓                                    → [LLM, T=0.7] → "...= 11"  ─┤
    "5+6=11. Answer: 11"                         → [LLM, T=0.7] → "...= 11"  ─┼→ Vote → 11 ✓
            ↓                                    → [LLM, T=0.7] → "...= 7"   ─┤
   (if this run erred: wrong forever)            → [LLM, T=0.7] → "...= 11"  ─┘
                                           diverse paths, consistent answer wins

Why this works — the Bayesian view:

Self-consistency approximates marginalizing over reasoning paths:

where is the -th sampled reasoning chain and extracts its final answer. You’re estimating the posterior probability of each answer by averaging across many independent reasoning paths. The most probable answer wins.

Numeric walkthrough — GSM8K-style example:

Question: “Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many does he have now?”

Sample k=5 chains:

Chain 1: "5 balls. 2 cans × 3 = 6 new. 5 + 6 = 11."              → 11
Chain 2: "5 original + (2 × 3) = 5 + 6 = 11 balls."              → 11
Chain 3: "Roger starts with 5. Buys 6 more. 5 + 6 = 11."         → 11
Chain 4: "5 + 2 = 7 balls."  (error: counted cans as balls)       →  7
Chain 5: "Each can: 3. Two cans: 3+3=6. Total: 5+6=11."          → 11

Vote tally:   11 → 4 votes   |   7 → 1 vote
Self-consistency selects: 11 ✓
Greedy decoding (single run): might return 7 if that error path
                               had higher local token probability

The correct answer (11) appears in 80% of chains. The wrong answer (7) is idiosyncratic — one particular way to misread “cans” — while there are many valid arithmetic paths to 11.

Find the instinct

Why does greedy decoding fail on reasoning tasks at all?

Reasoning chains are long and sequential. Each step is conditioned on everything before it. A locally plausible wrong token early in the chain (“5 + 2 = 7”) gets committed to, and every subsequent token now builds on that mistake. Greedy decoding optimizes locally at each step — but local optima compound into global failures.

The key insight: for tasks with a unique correct answer, diverse random samples of the reasoning process are weakly correlated in their errors but strongly correlated in their correct paths. Wrong paths are idiosyncratic; right paths converge. This is exactly the condition under which majority voting is reliable.

Why didn’t people try this immediately after chain-of-thought? Two reasons. First, CoT itself was so new that the immediate reaction was “this works!” rather than “what if we ran it 40 times?” Second, it only works cleanly when final answers can be compared as discrete symbols — numbers, letters, multiple-choice labels. Open-ended generation has no natural “most consistent” answer to vote on.

Results

PaLM 540B + 8-shot CoT, self-consistency with k=40 samples:

BenchmarkTypeGreedy CoTSelf-ConsistencyΔ
GSM8KMath (grade school)56.9%74.4%+17.5%
SVAMPMath (robust)79.0%89.5%+10.5%
AQuAAlgebra MCQ35.9%48.0%+12.1%
StrategyQACommonsense65.4%71.8%+6.4%
ARC-challengeScience QA81.0%84.9%+3.9%

Gains are largest on the hardest tasks — the ones where greedy CoT most often makes compounding errors. ARC-challenge is easy enough that greedy rarely fails catastrophically, so there’s less room to recover.

How many samples do you need?

Returns are steep from k=1 to k=10, then flatten. With PaLM on GSM8K:

  • k=1 (greedy): 56.9%
  • k=5: ~68%
  • k=10: ~72%
  • k=40: 74.4%

Most of the gain is captured by k=10–20. Going to k=40 adds ~2 more points at 4× the cost of k=10.

What breaks it:

  • Open-ended tasks: you can’t majority-vote free text
  • Systematic model bias: if the model always makes the same mistake, sampling more won’t help
  • Temperature too high: chains become incoherent, and voting on noise is useless
  • Tasks where final answers aren’t extractable as discrete symbols

Practical implications

If you’re doing chain-of-thought prompting on any structured reasoning task — math, code correctness, multi-step logic, classification — self-consistency is the first thing to try before reaching for a more expensive solution. The implementation is simple: sample k completions at T≈0.7, parse the final answer from each, take the mode. No new models, no training data, no infrastructure.

The cost tradeoff is linear: k=20 costs 20× a single greedy run. For high-stakes tasks (financial calculation, code verification, medical reasoning), this is almost always worth it. For interactive applications where latency matters, k=5 captures most of the gain at 5× cost.

One underrated benefit: self-consistency gives calibration for free. If 39/40 chains agree, you can trust the answer. If 12/40 agree, the model is uncertain — treat the output accordingly. Greedy decoding gives no signal about confidence.

The line of work this paper opened: chain-of-thought-prompting showed that reasoning paths help. Self-consistency showed that sampling reasoning paths helps more than committing to one. tree-of-thoughts-deliberate-problem-solving took the next step — instead of sampling paths randomly, search over them deliberately with lookahead and backtracking. The progression: single path → random ensemble → guided search.

Connections

Citation

arXiv:2203.11171

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. https://arxiv.org/abs/2203.11171