ReAct: Synergizing Reasoning and Acting in Language Models

Chain-of-thought prompting made language models better at reasoning. Tool use (like in WebGPT) made them better at acting in the world. But these two capabilities were studied separately, and systems that used them separately suffered from obvious failure modes: pure reasoning without action hallucinates facts, and pure action without reasoning blindly follows plans that aren’t working. ReAct (Reasoning + Acting), by Yao et al. at Princeton in 2022, proposed the obvious synthesis — interleave them — and showed it worked dramatically better than either alone.

The core idea

The analogy: A detective investigating a case doesn’t just reason in their head and then announce a conclusion (pure chain-of-thought). They don’t just take actions at random either (pure action). They think about what they know, decide to look something up, look it up, think about what they found, decide what to do next, take another action — a continuous alternation of thought and action. ReAct formalizes this loop for language models.

The format is simple. Instead of prompting the model to produce just a chain of thoughts, or just a sequence of actions, you prompt it to interleave three types of tokens. A ReAct trajectory at step $t$ is a tuple $(τ_{t}, a_{t}, o_{t})$ :

Thought $τ_{t}$ : internal reasoning about what to do or what was just observed
Action $a_{t}$ : a call to an external tool (search, lookup, etc.)
Observation $o_{t}$ : the result returned by the tool

This creates a trajectory like:

Question: "What is the population of the city where the Eiffel Tower is located?"

Thought: I need to find where the Eiffel Tower is located.
Action: Search[Eiffel Tower location]
Observation: The Eiffel Tower is located in Paris, France.

Thought: Now I need to find the population of Paris.
Action: Search[Paris population]
Observation: Paris has a population of approximately 2.16 million in the city proper.

Thought: I now have the information I need.
Action: Finish[2.16 million]

No single step is impressive on its own. The power is in the loop: each observation updates the belief state, each thought guides the next action, each action grounds the reasoning in real retrieved information.

The mechanism, step by step

Prompting: ReAct is a prompting technique, not a fine-tuned model (in the original paper). You provide few-shot examples (typically 2-6) of complete Thought/Action/Observation trajectories, then present the new question. The model generates thoughts and actions; you execute the actions and provide observations; repeat until the model calls Finish.

Supported action spaces:

HotpotQA/Fever (Wikipedia QA): Search[entity] and Lookup[term] (within the current passage) and Finish[answer]
ALFWorld (household tasks): navigation and manipulation commands (go to kitchen, pick up knife, etc.)
WebShop (online shopping): search[query], click[element], buy[item]

The key design choice: thoughts are not just decoration. They update the model’s internal context window. When the model writes “The tower is in Paris, I need to find Paris’s population,” that thought is prepended to the context for the next step. This creates an explicit scratchpad that accumulates facts, disambiguates past actions, and maintains the current plan — addressing a key failure mode of pure action-generation, where the model can “forget” earlier findings.

PURE CHAIN-OF-THOUGHT (no actions):
  Question → [reasoning in model's head] → Answer
  Problem: model fabricates facts it doesn't know

PURE ACTING (no reasoning):
  Question → Action → Observation → Action → ... → Answer
  Problem: rigid planning, can't recover when action fails

REACT:
  Question → Thought → Action → Observation → Thought → Action → ... → Finish
  Reasoning shapes action selection; observations update reasoning; errors recoverable

Find the instinct

The hallucination problem with pure CoT:

Chain-of-thought prompting dramatically improved LLM reasoning on math and logic. But for knowledge-intensive tasks — questions about real-world facts — it exposed a flaw: when the model doesn’t know something, it reasons as if it knows it, often confabulating plausible-sounding but wrong intermediate steps.

“ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API.”

The fix is that observations (retrieved facts) serve as “ground truth anchors.” When the model searches and gets back a passage, that passage is in the context verbatim. Subsequent reasoning is constrained by real information rather than parametric guesses.

The error recovery problem with pure action:

Imitation learning and reinforcement learning agents also struggle on long-horizon tasks because there’s no mechanism for recognizing that the current plan is failing. If action 3 took the agent to the wrong room, there’s nothing explicitly representing “I’m in the wrong place” — the model just continues generating actions based on the observation, which may not trigger course correction.

ReAct’s thoughts provide explicit error recovery:

“Thought: The action didn’t find what I expected. The target might be in a different location. Let me try…”

The thought step allows the model to notice discrepancies between expectation and observation and explicitly revise the plan.

Why this works with just few-shot prompting:

Large pretrained LLMs have been trained on text that includes reasoning + action patterns — problem-solving examples, Wikipedia articles that answer questions by consulting sources, forum posts where people search before responding. The ReAct format taps into these latent patterns. You don’t need to fine-tune; you just need to show the model the format with a few examples.

Results

On HotpotQA (multi-hop question answering):

CoT only: 29.4% EM
Act only: 25.7% EM
ReAct: 35.1% EM ( $+$ 5.7 points over CoT)
ReAct + CoT self-consistency: 40.4% EM

On FEVER (fact verification):

CoT only: 56% accuracy
Act only: 55%
ReAct: 60%

On ALFWorld (interactive household tasks):

BUTLER (imitation learning): 26% success
ReAct (2-shot): 71% success ( $+$ 45 points absolute)

On WebShop (online shopping agent):

IL + RL baseline: 59.9% average score
ReAct: 66.6% average score ( $+$ 10.7%)

ReAct outperforms on all four tasks, with the biggest gains on the interactive decision-making tasks where error recovery matters most.

What doesn’t work:

Context window pressure: long trajectories eventually overflow the context, cutting off earlier observations
Action space limitations: only works with actions that can be executed programmatically and return text
Thought quality: the model can generate subtly wrong reasoning that leads it confidently toward incorrect answers
Latency: each tool call adds latency; multi-hop questions require many calls

Practical implications

ReAct is the foundation of almost every “agentic” LLM system built since 2022. Langchain’s “Agent” abstraction, AutoGPT, the OpenAI Assistants API function-calling loop, and the ReAct-based reasoning in Claude’s tool use — all implement the core ReAct loop. The specific implementation varies, but the structure is always: think about what to do, do it, observe the result, think again.

When to use ReAct:

Tasks requiring real-time or private information (not in the model’s training data)
Multi-hop questions where the answer requires combining multiple facts
Interactive environments (booking, shopping, computer use)
Any system where errors in one step should be diagnosable and correctable

Connections

tool-use-agents — the technique introduced in this paper; ReAct is the foundational agentic pattern
chain-of-thought — ReAct extends CoT by interleaving reasoning with real-world actions
in-context-learning — ReAct is applied via few-shot prompting, a form of in-context learning
rag-retrieval-augmented-generation — RAG provides the retrieval backend that ReAct’s Search actions often use
toolformer-language-models-teach-themselves-tool-use — Toolformer teaches tool use differently (self-supervised), but addresses the same capability gap
chain-of-thought-prompting — ReAct extends CoT by grounding reasoning in real observations
deepseek-r1-reasoning-via-reinforcement-learning — R1 internalizes extended reasoning chains; ReAct externalizes them via tool calls

Citation

arXiv:2210.03629

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. https://arxiv.org/abs/2210.03629

ML Wiki

Explorer