The Problem

Language models trained on next-token prediction are essentially trained to produce fluent text. On tasks that require multi-step reasoning — arithmetic, logic, strategy — they often get the right answer by accident (pattern-matching the answer format) or fail entirely.

Before CoT, a typical LLM prompt looked like:

Q: Roger has 5 tennis balls. He buys 2 more cans, each has 3 balls. How many does he have?
A: 11

The model tries to map directly from question to answer in one step. For anything non-trivial, this doesn’t work. A single token generation can’t “do” multi-step arithmetic.

The Core Insight

Humans solve hard problems by thinking out loud. A student solving a math problem writes scratch work. An engineer debugging traces through states. A detective narrates possibilities.

What if you asked the model to do the same?

Chain-of-thought prompting includes intermediate reasoning steps in the few-shot examples. The model then generates its own steps before giving a final answer.

The question isn’t “can the model solve this?” It’s “can the model solve this if we give it room to think?”

Mechanism in Plain English

Few-shot CoT (the original formulation):

  1. Write 8 exemplars where each shows: problem → step-by-step reasoning → final answer
  2. Prepend these to the actual question
  3. The model follows the format: generates reasoning, then gives an answer

Zero-shot CoT (“Let’s think step by step”):

  1. Just append “Let’s think step by step” to the prompt
  2. No examples needed — the model generates its own reasoning
  3. Discovered by Kojima et al. (2022), works on most large models

Self-consistency (best results):

  1. Sample many reasoning paths (e.g., 40 different CoT completions)
  2. Take majority vote on the final answer
  3. Outperforms greedy CoT by ~10-15% on math benchmarks

ASCII Diagram

  Standard prompting (fails on multi-step):
  
  Q: If a train travels 60mph for 2.5 hours, how far? → [one step] → 150
  
  Works for simple arithmetic. For multi-step chained problems:
  
  Q: John has twice as many marbles as Sara. Sara has 3 more than Tom.
     Tom has 7. How many does John have? → [one step] → ??? (often wrong)
  
  ─────────────────────────────────────────────────────────────────────
  
  Chain-of-thought prompting:
  
  [Exemplar in prompt]:
  Q: John has twice as many marbles as Sara. Sara has 3 more than Tom.
     Tom has 7. How many does John have?
  A: Tom has 7 marbles.           ← step 1: anchor
     Sara has 7 + 3 = 10 marbles. ← step 2: first chain
     John has 2 × 10 = 20 marbles. ← step 3: second chain
     The answer is 20.            ← final answer
  
  [New question]:
  Q: Alice has 3 times as many books as Bob. Bob has 5 more than Carol.
     Carol has 4. How many does Alice have?
  A: Carol has 4 books.           ← model generates this...
     Bob has 4 + 5 = 9 books.
     Alice has 3 × 9 = 27 books.
     The answer is 27.            ← ...and this

Math Problem Walkthrough

Problem (from GSM8K benchmark): A store sells apples for $0.50 each. If you buy 6 or more, you get a 20% discount. Alice buys 8 apples. How much does she pay?

Without CoT (one-step model): often outputs “4.00” — wrong or accidentally right.

With CoT:

Step 1: Count apples Alice buys: 8 apples.
Step 2: Is 8 ≥ 6? Yes, so the discount applies.
Step 3: Base price for 8 apples: 8 × $0.50 = $4.00.
Step 4: 20% discount on $4.00: $4.00 × 0.20 = $0.80.
Step 5: Final price: $4.00 - $0.80 = $3.20.
The answer is $3.20.

PaLM 540B on GSM8K:

  • Standard few-shot: 17.9% accuracy
  • CoT few-shot: 56.9% accuracy (3.2x improvement)
  • CoT + self-consistency (40 samples): 74.4%

What’s Clever

The key observation: forcing the model to externalize intermediate steps lets it use its own output as working memory.

Transformers have no explicit working memory. Every token in the output can attend to everything before it — including previously generated reasoning steps. By making the model write “Tom has 7 marbles. Sara has 7+3=10 marbles,” each intermediate result becomes part of the context for the next computation.

This is what makes CoT work: the model is doing arithmetic not in one token but across many tokens, each attending to the results of previous computations.

The emergence fact: CoT only works for models above ~100B parameters. For smaller models, CoT prompting either doesn’t help or actively hurts. This suggests the model needs enough capacity to both generate coherent reasoning steps AND use those steps correctly. At smaller sizes, the model might produce plausible-looking steps that don’t actually contribute to a correct answer.

The faithfulness question: CoT steps don’t necessarily reflect internal computation. A model can produce a correct final answer with a wrong reasoning chain, or produce a correct chain that doesn’t match how the model internally arrives at the answer. The steps are a kind of narrative rationalization, not a window into computation.

Key Sources

Open Questions

  • Whether model is “actually reasoning” vs. pattern matching reasoning-shaped text
  • Optimal length and format of reasoning chains
  • Faithfulness: do CoT explanations reflect actual model computations?