Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

You’re administering a test to a student. You ask: “A train leaves Chicago at 9am going 60mph toward New York, 800 miles away. A second train leaves New York at 10am going 80mph toward Chicago. At what time do they meet?”

One student writes “2pm” — correct answer, no working shown.

Another writes: “In the first hour, the Chicago train travels 60 miles alone, so they’re 740 miles apart when the NY train departs. Together they close 60+80=140 miles per hour. 740/140 = 5.28 hours after 10am = approximately 3:17pm.” That student got the wrong answer — but you know where the error is. You can teach them.

Large language models in 2021 were the first student, amplified to absurdity. GPT-3 and its peers had absorbed almost everything written by humans — but when asked to solve a math problem, they just… produced an answer. No reasoning visible. And the answer was often wrong in ways that suggested the model had no idea why it was right or wrong.

The core idea

The analogy: You’re teaching a new hire to handle customer refund requests. You could give them examples of (customer problem → refund decision), and let them pattern-match. Or you could give them examples of (customer problem → reasoning → refund decision), and let them learn the thought process. The second approach is more work to write out, but the hire can generalize to problems they’ve never seen. That’s chain-of-thought prompting.

The insight: language models are trained to complete text. If you give them exemplars where reasoning precedes the answer, they’ll generate reasoning before their answer. The reasoning isn’t decorative — it actually helps.

Before this paper, few-shot prompting looked like this:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many tennis balls does he have now?
A: 11

Q: The cafeteria had 23 apples. If they used 20 to make lunch and
   bought 6 more, how many apples do they have?
A: [model answers here]

Chain-of-thought prompting changes the answer format in the examples:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many tennis balls does he have now?
A: Roger started with 5 tennis balls. 2 cans of 3 tennis balls
   each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and
   bought 6 more, how many apples do they have?
A: [model now generates reasoning before answering]

That’s the whole trick. No fine-tuning. No new model weights. No retraining. Just different example formatting in the prompt.

The mechanism, step by step:

You write 8 examples of (question, chain-of-thought reasoning, answer). The questions are from the target domain (math, commonsense, symbolic reasoning). The reasoning is written by a human — it’s explicit and correct.
You append your test question.
The model, having seen 8 examples where reasoning precedes answers, generates its own reasoning chain for the new question.
The model then gives an answer — which you extract (often by looking for “The answer is” or a number following the reasoning).
The reasoning chain serves as scratch space: intermediate computations the model doesn’t have to keep track of internally.

The key mechanism: transformers generate text token by token, left to right. Once the model writes “23 - 20 = 3, then 3 + 6 =”, the next token is almost certainly “9” — because that’s what follows that arithmetic. The chain of thought scaffolds the computation through the generation process.

WITHOUT CHAIN-OF-THOUGHT:
  Question → [black box] → Answer
                              ↑ 
                        Often wrong on
                        multi-step problems

WITH CHAIN-OF-THOUGHT:
  Question → Step 1 → Step 2 → Step 3 → Answer
                ↑         ↑         ↑       ↑
           Written    Written   Written  Now easy —
           to context  to       to       answer
             window  context  context  follows from
                     window   window   prior steps

Walkthrough with actual numbers:

The paper uses this exact example (one of the 8 exemplars used in experiments on GSM8K):

Q: "There are 15 trees in the grove. Grove workers will plant
    trees in the grove today. After they are done, there will
    be 21 trees. How many trees did the grove workers plant today?"

STANDARD FEW-SHOT ANSWER: 6

CHAIN-OF-THOUGHT ANSWER:
"There are 15 trees originally. Then there were 21 trees after
some more were planted. So there must have been 21 - 15 = 6.
The answer is 6."

On the GSM8K benchmark (grade-school math word problems), the paper’s numbers:

Model                    | Parameters | Accuracy (GSM8K)
PaLM (standard prompting)|   540B     |   17%
PaLM (CoT prompting)     |   540B     |   57%
Fine-tuned GPT-3 + verifier|   175B   |   55%

PaLM + CoT with 540B params beats fine-tuned GPT-3 with a verifier.
57% vs 55% — and the CoT approach required zero additional training.

For context: 17% → 57% is a 3.4× improvement on the same model with no weight changes. You changed the prompt, not the model.

What’s clever — find the instinct:

The obvious way to make models better at reasoning would be fine-tuning. Train them on labeled reasoning chains. The community was doing this — scraping math datasets, training specialized models, building verifiers. All of this required significant compute and carefully curated data.

The instinct behind chain-of-thought is different: the model already knows how to reason — it just isn’t being asked to show the work. The pretrained model has seen countless worked examples of math problems in its training data. It’s seen textbooks, tutoring websites, Stack Overflow answers, forum explanations. The capability isn’t missing; it’s not being triggered.

The question becomes: what triggers it? The answer turns out to be simply — examples. If you show the model 8 examples where people wrote out their reasoning, it infers that this is the expected mode of response and does the same.

This reframes the problem from “how do we teach models to reason?” to “how do we elicit reasoning that’s already there?” That’s a fundamentally cheaper intervention.

“We explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.”

Translation: writing out the steps is doing real computational work. The chain of thought isn’t a post-hoc explanation — it’s the actual computation.

The emergence caveat — and why it matters:

The most unsettling finding in the paper is the scale dependence. Chain-of-thought prompting doesn’t help small models. It either does nothing or makes things worse.

The paper tests three model families at different scales (GPT-3 family, PaLM, LaMDA). The pattern is identical across all three: below ~100 billion parameters, CoT prompting provides zero benefit or hurts. Above ~100B, it starts helping. The gains get larger as you scale further.

“Chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of ~100B parameters.”

Translation: this isn’t something you can apply to a small model and get benefits. The model needs to be big enough that it can actually use the reasoning chain — meaning it has sufficient world knowledge and reasoning capacity to generate correct intermediate steps. A small model generates plausible-looking but wrong reasoning chains, which lead it to wrong answers.

This is genuinely strange and important. It means there’s a phase transition in model capability at scale. You can’t predict, from small-model performance, whether a large model will exhibit this capability. The capability isn’t present in embryonic form in smaller models and growing — it’s absent in smaller models and then present in larger ones.

Why does the reasoning chain actually help? Three mechanisms:

Decomposition — multi-step problems require multiple pieces of information to be used in sequence. Writing step 1 makes step 2 easier; the context window now contains the output of step 1.
Interpretability — if the chain is wrong, you can see where it went wrong. This isn’t just useful for debugging; during generation, the model can implicitly “check” its own steps by reading what it wrote.
Length generalization — some problems require more steps than others. Chain-of-thought allows the model to take as many intermediate steps as needed without being constrained by a fixed answer format.

“The promise of this work is in the simplicity of the approach: chain of thought prompting is not specific to language models, and may be useful for any task that can benefit from explicit reasoning steps.”

Does it work? What breaks?

Task	Model	Standard Prompting	Chain-of-Thought	Gain
GSM8K (math)	PaLM 540B	17%	57%	+40 pts
AQUA-RAT (algebra)	PaLM 540B	23%	35%	+12 pts
CommonsenseQA	PaLM 540B	73%	79%	+6 pts

The gains are largest on tasks that require multiple steps (arithmetic, symbolic). They’re smaller on single-step commonsense tasks where the answer is more directly accessible.

What doesn’t work:

Scale is non-negotiable. Below ~100B parameters, you will see zero benefit or regression. If you’re working with smaller models (7B, 13B), vanilla CoT prompting as described here won’t help you. (Later work — fine-tuning smaller models on chain-of-thought data — helps somewhat, but that’s a different technique.)

The 57% accuracy on GSM8K sounds good, but it means the model fails 43% of the time. On problems that require more than 4-5 steps, accuracy drops sharply. The model can lose track of its own chain, produce an arithmetic error early that propagates, or generate a reasoning chain that looks coherent but is factually wrong.

There’s also a sensitivity issue: the quality of the exemplars matters. Poorly written chains (ambiguous steps, wrong intermediate values) can degrade performance.

Finally, and this was noted by follow-up researchers: the reasoning chains generated by CoT aren’t always causally related to the answer. Sometimes the model generates a plausible-looking chain and then outputs the answer it would have output anyway, with the chain being a post-hoc rationalization rather than the actual computation. This was later called “unfaithful reasoning.”

So what?

If you’re building systems that need multi-step reasoning — code generation, math tutoring, diagnostic systems, analytical QA — chain-of-thought prompting should be your first experiment. The cost is near-zero: write 8 good examples with reasoning chains, add them to your prompt, and test. If you’re on a model >70B parameters (or using a strong API model like GPT-4 or Claude), you’ll almost certainly see improvement on complex reasoning tasks.

The important practical decision: do you need the reasoning chain in the output? Sometimes you want the model’s reasoning visible (for debugging, for user trust, for building an audit trail). Sometimes you just need the answer. For the latter, you can try “zero-shot chain-of-thought” (add “Let’s think step by step” to the question — a technique from a follow-up paper), which elicits reasoning without requiring you to write exemplars.

Chain-of-thought prompting didn’t teach models to reason — it discovered that they already could, and found the key to unlock it.

Connections

transformer — the architecture that makes in-context learning possible
lora — the other lever for task-specific behavior (weights vs. prompting)
scaling-laws — CoT’s emergence threshold is a direct consequence of scaling dynamics
rlhf — post-training alignment that builds on CoT-capable models

Citation

arXiv:2201.11903

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. https://arxiv.org/abs/2201.11903

ML Wiki

Explorer

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

The core idea

Does it work? What breaks?

So what?

Connections

Citation

Graph View

Table of Contents

Backlinks