From Prompting to Agency — Reasoning and Tool-Using LLMs

This path traces how LLMs moved beyond single-shot text generation into structured reasoning, self-consistency, deliberate search, and autonomous tool use. Each step addresses a limitation of the previous: prompting alone breaks on complex tasks, consistency requires sampling, search requires trees, action requires grounding, tools require self-teaching, knowledge requires retrieval, and correctness requires verification.

Step 1 — Chain-of-Thought Prompting

chain-of-thought-prompting

Start here. Chain-of-thought prompting is the observation that asking a language model to “think step by step” before answering dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks. The insight: intermediate reasoning steps make the computation explicit in the token stream, where the model can condition on them. This is not fine-tuning — it’s a prompting technique that works on frozen models above a certain scale. Every subsequent step in this path builds on or reacts to this foundation.

Step 2 — Self-Consistency

self-consistency-chain-of-thought-reasoning

Chain-of-thought generates one reasoning path. But for most problems, multiple valid reasoning paths lead to the same correct answer — and sampling them gives you an ensemble for free. Self-consistency samples K chain-of-thought traces from the model and takes a majority vote over final answers. Incorrect reasoning paths tend to produce inconsistent wrong answers; correct ones converge. This simple inference-time trick improves over single chain-of-thought by 5–15% on math benchmarks, with no training required.

Step 3 — Tree of Thoughts

tree-of-thoughts-deliberate-problem-solving

Self-consistency samples multiple independent paths. Tree of Thoughts turns reasoning into a deliberate search: a tree where each node is a partial reasoning state, expansion proposes next steps, and a heuristic evaluates whether the path is promising. The model can backtrack. This matches how humans actually solve hard problems — exploring, evaluating, pruning — and outperforms chain-of-thought on tasks that require planning, where greedy generation routinely gets stuck in locally-plausible but globally-wrong paths.

Step 4 — ReAct: Reasoning and Acting

react-reasoning-and-acting

Tree of Thoughts reasons over internal states. ReAct extends reasoning into the external world. By interleaving “thought” tokens and “action” tokens in the generation trace, the model can reason about what action to take, execute it (search, lookup, calculate), observe the result, and continue reasoning. The action-observation loop grounds the chain of thought in real retrieved information — fixing the hallucination problem that pure chain-of-thought inherits from the model’s parametric memory.

Step 5 — Toolformer

toolformer-language-models-teach-themselves-tool-use

ReAct uses prompting to teach tool use. Toolformer trains it in. The model is taught to insert API calls (calculator, search, calendar, translation) into its own generation, and is trained on self-generated examples of when those calls help and when they don’t. The key result: the model learns to call tools sparingly and accurately, improving downstream perplexity across tasks without sacrificing fluency. This is how tool use moves from a prompting trick into a trained capability.

Step 6 — RAG: Retrieval-Augmented Generation

rag-retrieval-augmented-generation

Toolformer and ReAct use retrieval as one tool among many. RAG makes retrieval a first-class part of the architecture. A dense retriever fetches relevant documents from a corpus, which are prepended to the generation context. The retriever and generator are trained end-to-end. This addresses the core limitation of parametric LLMs: knowledge is frozen at training time. With RAG, the model can access up-to-date information, domain-specific corpora, and private documents without retraining.

Step 7 — DeepSeek-R1

deepseek-r1-reasoning-via-reinforcement-learning

The endpoint of the path: what happens when you replace prompting-based reasoning with trained reasoning? DeepSeek-R1 applies RL with verifiable reward signals — correctness on math, code, and logic — to train a model that generates long internal reasoning traces before answering. No chain-of-thought demonstrations are needed; the RL signal teaches the model to reason by rewarding correct final answers. The emergent behaviors (backtracking, self-correction, reflection) resemble Tree of Thoughts but arise from training rather than inference-time search.

ML Wiki

Explorer