Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Concepts: chain-of-thought | in-context-learning | reasoning-rl | tool-use-agents Builds on: chain-of-thought-prompting | self-consistency-chain-of-thought-reasoning Leads to: react-reasoning-and-acting | toolformer-language-models-teach-themselves-tool-use

Chain of thought prompting gives an LLM a single path from problem to answer. Self-consistency samples many such paths independently, then majority-votes. Both leave a critical flaw untouched: the model commits to each step greedily, left-to-right, with no way to look ahead, no way to say “this direction is a dead end,” no way to back up and try something else.

For most question-answering tasks, this is fine — the path to the answer is short and forgiving. But some problems aren’t like that. Game of 24 asks you to combine four numbers with arithmetic to reach 24. The first operation you choose can make the rest trivially easy or mathematically impossible — and there’s no way to know until you’ve tried. CoT with GPT-4 solves it 4% of the time. You read that right. 4%.

Tree of Thoughts (ToT) fixes this by giving the LLM one capability it’s always lacked: the ability to change its mind.

The core idea

The analogy: A chess grandmaster doesn’t play by making the single best-looking move at each moment and hoping for the best. They maintain a tree of candidate lines — “if I go here, they go there, then I can go…” — evaluate which lines look promising after a few steps of lookahead, abandon dead-end lines early, and play the move that heads toward the best reachable position. They search, not just respond.

Every LLM before ToT was playing a different game: one where it commits to the first word of its first move before it knows where the move is going. Chain-of-thought made it commit to steps rather than single tokens, which helped. Self-consistency made it take many independent shots and pick the most popular answer, which helped more. But neither method can backtrack. If your first step is wrong, you’re stuck with it.

ToT generalizes CoT from a chain to a tree. The model:

Proposes multiple candidate “thoughts” (partial solutions) at each step
Evaluates each candidate — using the LLM itself to score how promising each partial solution looks
Searches the resulting tree with BFS or DFS, keeping only the best candidates at each level, backtracking when a branch is hopeless

The key insight: the LLM plays two roles simultaneously. It’s the generator that proposes thoughts, and it’s the evaluator that scores them. No external verifier, no learned reward model — just the same model, wearing two hats.

STANDARD CoT (a chain):
  Problem
    └── Step 1 → Step 2 → Step 3 → Answer
         (one path, no backtracking)

SELF-CONSISTENCY (parallel chains):
  Problem
    ├── Chain 1 → answer_1
    ├── Chain 2 → answer_2
    └── Chain 3 → answer_3
         majority_vote(answer_1, answer_2, answer_3)
         (many paths, no pruning, vote at the end only)

TREE OF THOUGHTS (a tree with lookahead):
  Problem
    ├── Thought A (score: sure)
    │    ├── Thought A1 (score: sure) → Answer ✓
    │    └── Thought A2 (score: impossible) ✗ pruned
    ├── Thought B (score: maybe)
    │    └── Thought B1 (score: maybe) → ...
    └── Thought C (score: impossible) ✗ pruned early
         (search with evaluation, prune bad branches, backtrack)

The mechanism, step by step

Any ToT instantiation answers four questions:

1. What is a “thought”?

A thought is a coherent intermediate step toward the solution. Its granularity depends on the problem:

In Game of 24: one arithmetic equation (“13 - 9 = 4, remaining: 4, 4, 10”)
In creative writing: a paragraph-level plan
In 5×5 crosswords: one word filled in

As the paper puts it: “a thought should be ‘small’ enough so that LMs can generate promising and diverse samples… yet ‘big’ enough so that LMs can evaluate its prospect toward problem solving.” A thought too small can’t be evaluated; a thought too large is too unpredictable to generate well.

2. How to generate candidate thoughts?

Two strategies, depending on the thought space:

Sample independently — draw $k$ thoughts i.i.d. from $p_{θ}^{C o T} (z_{t + 1} ∣ x, z_{1 \dots t})$ . Works when the space is rich (creative writing, where diversity matters).
Propose in one prompt — ask the LM to list $k$ different candidates at once: $[z^{(1)}, \dots, z^{(k)}] \sim p_{θ}^{p ro p ose} (z_{t + 1}^{(1 \dots k)} ∣ s)$ . Better when thoughts are constrained (equations, single words) where independent sampling might produce duplicates.

3. How to evaluate states?

This is the clever part. The LM is prompted to assess each partial solution:

Score independently: “Given the current state, is this partial solution sure/maybe/impossible of reaching a valid answer?” Each state gets a value via $V (p_{θ}, S) (s) \sim p_{θ}^{v a l u e} (v ∣ s)$ .
Vote comparatively: Given several candidates, prompt the LM: “Which of these partial solutions is most promising?” $s^{*} \sim p_{θ}^{v o t e} (s^{*} ∣ S)$ .

For Game of 24, the value prompt does a few lookahead steps: “Can you reach 24 from 4, 4, 10? Let’s see: 4+4=8, 8+10=18 — no. 4×4=16, 16+10=26 — no. (10-4)×4=24 — yes! → sure.” This commonsense elimination dramatically cuts the search space.

4. What search algorithm?

BFS (breadth-first search): At each depth level, keep only the top- $b$ highest-valued states. The BFS frontier update is:

$S_{t} \leftarrow ar g max_{S \subset S_{t}^{'}, ∣ S ∣ = b} \sum_{s \in S} V_{t} (s)$

Used for Game of 24 (depth 3, breadth $b = 5$ ). Shallow trees where evaluation signals are strong.

DFS (depth-first search): Explore the most promising state first until reaching a full solution or a dead-end (score $\leq v_{t h}$ ), then backtrack. Used for crosswords where depth is variable and backtracking is essential.

Numeric walkthrough: Game of 24

Input: 4, 9, 10, 13. Goal: reach 24 using each number exactly once with $+, -, \times, \div$ .

Step 1 — Propose k=5 candidate first operations:

Candidate A: 13 - 9 = 4   → remaining: {4, 4, 10}
Candidate B: 10 + 9 = 19  → remaining: {4, 13, 19}
Candidate C: 4  * 9 = 36  → remaining: {10, 13, 36}
Candidate D: 13 + 9 = 22  → remaining: {4, 10, 22}
Candidate E: 10 - 4 = 6   → remaining: {6, 9, 13}

Evaluate via lookahead:

Candidate A — remaining {4, 4, 10}:
  (10 - 4) × 4 = 6 × 4 = 24 ✓
  → score: "sure"   ← kept

Candidate B — remaining {4, 13, 19}:
  19 - 13 = 6,  6 × 4 = 24 ✓
  → score: "maybe"  ← kept

Candidate C — remaining {10, 13, 36}:
  36 - 13 = 23, 23 + 10 = 33. No.
  36 / (13 - 10) = 12. 12 + ? = no 2nd number left.
  → score: "impossible"  ← pruned

Candidate D — remaining {4, 10, 22}:
  22 + 4 = 26, 22 - 4 = 18, 22 × 4 = 88 — none helpful.
  → score: "impossible"  ← pruned

Candidate E — remaining {6, 9, 13}:
  13 - 9 = 4, 4 × 6 = 24 ✓
  → score: "sure"   ← kept

Step 2 — Extend the “sure” branch A — remaining {4, 4, 10}:

Next op: 10 - 4 = 6  → remaining: {4, 6}
  4 × 6 = 24 ✓

Full solution recovered: $(13 - 9) \times (10 - 4) = 4 \times 6 = 24$ ✓

With breadth $b = 5$ , ToT solves 74% of hard Game of 24 problems. CoT (no tree, no evaluation) solves 4% of the same problems. The pruning alone — cutting impossible branches before they waste further computation — is doing most of the work.

Results

Method	Game of 24	Creative Writing (GPT-4 score)	Crosswords (game %)
IO prompt	7.3%	6.19	0%
CoT prompt	4.0%	6.93	1%
CoT-SC (k=100)	9.0%	—	—
IO + Refine (k=10)	27%	7.67	—
ToT (b=5)	74%	7.56	20%

For creative writing, humans prefer ToT over CoT in 41/100 blind comparisons (vs. 21/100 preferring CoT). The other 38 are judged equally coherent.

The crossword result — 20% complete game success vs. 0% for IO — represents a qualitative change. Filling all 10 words of a 5×5 grid correctly requires any single wrong word to fail the entire game. ToT with DFS and backtracking is the difference between solving it at all and never solving it.

Error analysis on Game of 24: Around 60% of CoT failures happen at step 1 — the very first operation is invalid or leads to an unreachable state. The model commits to three words (e.g., “4 + 9”) before checking whether any valid 24-path remains. This is the greedy commitment problem in stark numerical form.

What breaks: ToT is expensive. Game of 24 with $b = 5$ uses roughly 100 LM calls per problem (5 proposals × 3 value samples per step × 3 steps = ~45 calls, plus expansion). CoT uses 1. This is a 30–100× cost multiplier. The authors frame ToT as a research framework rather than a deployment-optimized system — real applications would need smarter call batching and cheaper value heuristics.

The LM-as-evaluator also has limits: value estimates are heuristic and can be wrong. A state scored “impossible” might actually lead to a solution. DFS with too-aggressive pruning can cut off valid paths. The method works best when the LM can do reliable short lookahead within a single prompt — which depends heavily on the task and model capability.

So what?

If you’re building systems that need reasoning on problems with combinatorial structure — mathematical derivations, code debugging, multi-step planning, constraint satisfaction — ToT is the framework to think with first. The practical question is whether the 30–100× token cost is worth it for your use case. For problems where CoT success rates are below 10%, almost certainly yes. For tasks where CoT already works well at 70%+, the overhead is hard to justify.

The bigger lesson is architectural. chain-of-thought-prompting gave the LLM a scratchpad. self-consistency-chain-of-thought-reasoning gave it multiple scratchpads and a vote. ToT gives it a deliberative process — the ability to evaluate partial work and redirect effort. These three papers trace a clear arc: single-pass generation → parallel generation → structured search. What’s next in this arc is LLMs that search over tool calls and environment states, not just reasoning steps. react-reasoning-and-acting is already there — and the core insight is the same one ToT introduces: intermediate outputs should be evaluated, not just generated.

“LM inference as tree search” turns out to be more than a clever trick for Game of 24. It’s a mental model that lets you design reasoning systems for any problem where greedy generation gets stuck — which is most hard problems worth solving.

Connections

chain-of-thought — ToT generalizes CoT from a chain to a tree
in-context-learning — ToT operates entirely at inference time, no fine-tuning
reasoning-rl — deliberate tree search is a precursor to RL-trained reasoning
tool-use-agents — ToT’s generate-evaluate-search loop anticipates agentic frameworks
sampling — thought generation uses sampling strategies (i.i.d. or sequential proposal)
chain-of-thought-prompting — the linear reasoning chain ToT generalizes
self-consistency-chain-of-thought-reasoning — parallel chains with majority vote; ToT replaces the vote with tree search and pruning
react-reasoning-and-acting — interleaves reasoning and acting; builds on the deliberate reasoning paradigm

Citation

arXiv:2305.10601

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023. https://arxiv.org/abs/2305.10601

ML Wiki

Explorer