Language models are paradoxically capable and incapable at the same time. A GPT-class model can explain quantum mechanics, write poetry, and debug code — but it can’t tell you today’s date, gets basic arithmetic wrong, and hallucinates facts it should be certain about. These failures aren’t mysteries: a text-completion model can’t look at a clock, a neural network is a bad calculator, and parametric memory has a cutoff. The obvious fix: give the model tools. The hard part: how do you train it to use them? Toolformer (Schick et al., Meta AI, 2023) shows you can do it with almost no human supervision — the model teaches itself when and how to call external APIs.
The core idea
The analogy: Imagine you’re a librarian who needs to learn to use a new reference system. One approach: someone sits with you and, for every possible question type, tells you when to consult the reference system and what query to use. Another approach: you have access to the reference system and a pile of tasks. You try inserting reference lookups at various points while doing your tasks, and keep the ones that made your responses more accurate. You learn from the outcome.
Toolformer takes the second approach. Given a corpus of text and a set of tools (calculator, search engine, calendar, etc.), it:
- Generates candidate API calls at plausible positions in the text
- Evaluates whether each API call actually helped (by comparing perplexity of the text with vs. without the API result inserted)
- Keeps only the helpful API calls
- Fine-tunes the model on the resulting self-annotated dataset
No human needs to label “use the calculator here” or “search for this query here.” The model bootstraps tool use from outcome quality.
The mechanism, step by step
API call format:
Toolformer uses a simple in-line API call format embedded in text:
The population of France is [QA("What is France's population?")->67 million] 67 million people.
Or for a calculator:
At 12% annual interest, $1000 grows to [Calculator(1000 * 1.12)->1120] $1120 after one year.
The format is: [ToolName(input) -> output]. The model learns to generate this syntax when it would be helpful.
Self-supervised dataset generation:
Step 1: Sample candidate API call positions
For each training text, sample positions where an API call might be appropriate. This is done by prompting the model with few-shot examples of good API calls and asking it to generate candidate insertions.
Step 2: Execute the API calls
Run each candidate API call against the real tool (search engine, calculator, etc.) and get the result.
Step 3: Filter by utility
For each candidate call at position in the text, compare:
- = cross-entropy loss on tokens after position using the model without the API result
- = cross-entropy loss on tokens after position with the API result inserted in context
Keep the API call only if:
where is a filtering threshold. Translation: the API call is “helpful” only if it reduces the loss on the continuation — i.e., the model predicts the rest of the text more accurately when it has the tool’s result than when it doesn’t.
Step 4: Fine-tune
Fine-tune the model on the filtered dataset containing only helpful API calls. The model learns: in contexts like this, call this tool with this kind of query.
ORIGINAL TEXT: "The speed of light is approximately 299,792,458 meters per second."
CANDIDATE API CALL INSERTION:
"The speed of light is approximately [Calculator(299792458/1000)->299792.458]
299,792 km/s..."
UTILITY CHECK:
Without API result: model predicts "299,792" with perplexity X
With API result "299792.458" in context: model predicts "299,792" with perplexity Y
If Y < X - threshold: keep this example
KEPT EXAMPLE TEACHES: When generating precise numerical facts, consider inserting
a Calculator call
Tools included:
The paper implements Toolformer with five tools:
- Calculator: arithmetic expressions (handles the LLM arithmetic weakness)
- Wikipedia Search: retrieves the intro paragraph of a Wikipedia article
- Q&A system: a fine-tuned QA model for factual questions
- Neural MT: machine translation system
- Calendar: returns the current date
Find the instinct
Why does the loss filter work?
The key insight is that a language model’s loss on the text that comes after a fact is a proxy for how well the model knows that fact. If a text contains “The population of Paris is 2.16 million, which makes it…” — the model assigns high probability to “2.16 million” continuing correctly because it has seen that fact in training data. If you remove the fact and ask the model to predict it, spikes.
When an API call returns the correct fact and inserts it into context, the model’s uncertainty about subsequent text drops — decreases. A search result for “Paris population” that returns “2.16 million” tells the model exactly what to predict next. This creates a clear signal: whenever , the tool was useful here.
The filter is self-referential in an elegant way: the model evaluates its own uncertainty to determine which tool calls reduce that uncertainty. No external judge needed.
The bootstrapping problem:
One challenge: the model starts with very few examples of API call usage, so how does it generate good candidate calls in the first place? The paper solves this with a few-shot prompting trick: each API has 2-4 hand-written examples showing the format and typical usage. The model uses these as templates to generate candidates, then filters them. The hand-written examples don’t need to be numerous — they just need to show the format.
Why this is different from ReAct and function calling:
ReAct prompts an existing model to interleave thoughts and actions at inference time. Toolformer trains the tool-calling behavior into the model weights. After fine-tuning, the model generates API call syntax naturally when appropriate — it doesn’t need special prompting or an external orchestration loop. The behavior is internalized.
Function calling (as in OpenAI’s function calling API) is an engineering-level version of this idea: the model generates structured JSON calls to external functions, executed by the API layer. Toolformer shows the underlying principle: you can teach a model to recognize when external tools would help and what to pass to them.
Results
Evaluated on a diverse set of downstream benchmarks after fine-tuning GPT-J 6.7B with Toolformer:
| Task | GPT-J (no tools) | Toolformer | GPT-3 (no tools) |
|---|---|---|---|
| LAMA (factual knowledge) | 34.6% | 37.7% | 41.3% |
| Math QA (word problems) | 15.2% | 22.0% | 20.8% |
| TempLAMA (time-sensitive facts) | 20.1% | 32.9% | 36.1% |
| Multi-lingual QA | 48.2% | 54.8% | 50.8% |
| CCNet (language modeling) | baseline | 0 regression | — |
Toolformer (6.7B) outperforms GPT-3 (175B) on math word problems (22.0% vs 20.8%), despite being 26 smaller. The calculator tool alone explains this: arithmetic accuracy improves from 15% to 22% because the model now delegates calculation to a deterministic external process.
Critically: language modeling perplexity doesn’t degrade. Fine-tuning on tool-use data doesn’t hurt the model’s core language generation ability. The tool calls appear in contexts where they help; elsewhere, they don’t appear.
What doesn’t work:
- The five tools are limited and domain-specific. Generalizing to arbitrary APIs requires more examples per tool.
- Complex multi-step tool use (chaining multiple calls) is not handled well
- The model sometimes generates syntactically valid but semantically wrong API calls (e.g., poor search queries)
- The self-supervised approach has biases: it tends to generate API calls for things the model is already uncertain about, which may not match what humans want
Practical implications
Toolformer’s contribution is demonstrating that tool use can be trained via self-supervision at scale. You don’t need expensive human annotation of “use the calculator here.” You need:
- A set of tools with APIs
- A few examples per tool (few-shot)
- A corpus of text
- The filtering algorithm
This makes tool-augmented language model training accessible to researchers without large annotation budgets. The descendants — Gorilla (API calling for ML libraries), ToolBench (training on thousands of tools), and the OpenAI function calling fine-tuning — are all variations on this self-supervised annotation approach.
Connections
- tool-use-agents — the capability this paper trains via self-supervised learning
- in-context-learning — few-shot examples per tool bootstrap the candidate call generation
- fine-tuning — Toolformer fine-tunes the model on a self-annotated dataset of useful tool calls
- react-reasoning-and-acting — ReAct uses tools at inference time via prompting; Toolformer trains tool use into model weights
- rag-retrieval-augmented-generation — RAG provides structured retrieval; Toolformer learns to call retrieval as a tool
- attention-is-all-you-need — the underlying architecture being extended with tool-calling capability
Citation
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. https://arxiv.org/abs/2302.04761