Concepts: tool-use-agents | code-generation | chain-of-thought | in-context-learning Builds on: react-reasoning-and-acting — ReAct’s Thought/Action/Observation loop, but with Python as the action format Builds on: toolformer-language-models-teach-themselves-tool-use — Toolformer teaches a model to call APIs; CodeAct flips the question — let the model write code that calls them itself Leads to: the open-source CodeActAgent stack and the broader “code-as-action” pattern in agentic frameworks (OpenHands, Smol Agents, Aider)

The conventional way to give a language model tools is to define them as functions with JSON schemas: search(query: str) -> str, weather(city: str) -> dict. The model picks one, the agent runtime parses the JSON, executes the function, and returns a string observation. CodeAct (Wang et al., ICML 2024) asks: why route everything through a JSON-schema bottleneck? Pre-trained models are excellent Python programmers. Let them emit Python directly — result = search(query); if "error" in result: return weather(city) — and execute the code in a sandboxed interpreter. The action space is now Turing complete, the model can compose tools fluently, and a single action can do the work of three JSON calls.

The core idea

The analogy: Pre-tool-using LLMs were like office workers limited to a checkbox form: “Pick one action from this list.” JSON tool-calling expanded the form but kept the constraint — one tool, one call, one response. CodeAct hands them a terminal and says: write a Python script. They can chain function calls, loop over results, build composite outputs, parse errors, and self-correct — all in one “action.”

The action format flips from:

{"name": "search", "arguments": {"query": "Eiffel Tower height"}}

…to:

result = search("Eiffel Tower height")
height_m = parse_height(result)
print(f"Tower height: {height_m} meters")

The Python interpreter executes this; stdout becomes the observation; the model sees output, errors, and the existing variable bindings (the interpreter session is persistent across turns).

Walkthrough

A typical CodeAct trajectory (paraphrased from the paper’s data analysis example):

USER: "Plot the per-quarter revenue from this CSV and show me the YoY growth."

THOUGHT: I need to load the CSV, compute quarterly aggregates, and plot.

ACTION (code):
  import pandas as pd
  import matplotlib.pyplot as plt
  df = pd.read_csv("revenue.csv")
  df['quarter'] = pd.PeriodIndex(df.date, freq='Q')
  q = df.groupby('quarter')['revenue'].sum()
  print(q)

OBSERVATION:
  2023Q1    1.20e6
  2023Q2    1.35e6
  ...
  2024Q4    2.10e6

THOUGHT: Now compute YoY: (q[t] - q[t-4]) / q[t-4].

ACTION (code):
  yoy = (q - q.shift(4)) / q.shift(4)
  print(yoy.dropna())
  q.plot(kind='bar', title='Quarterly Revenue')
  plt.savefig('revenue.png')

OBSERVATION:
  2024Q1   0.18
  2024Q2   0.22
  ...
  Figure saved to revenue.png

THOUGHT: Done.
ACTION: response("YoY growth ranges from 18-25%; Q4 strongest. Plot saved.")

Compare this to JSON tool-calling: the same workflow would need ~6 separate tool calls (read_csv, aggregate, compute_yoy, plot, save, respond), each round-tripping through the schema validator and the model’s context. CodeAct does it in 2 actions.

What’s clever — find the instinct

The key recognition: pre-trained LLMs were trained on millions of Python programs and exactly zero JSON tool-call schemas. Every line of pip-installable Python is in their training data. When you ask a model to emit JSON tool-calls, you’re asking it to operate in a domain it has barely seen. When you ask it to emit Python, you’re asking it to do what it was already best at.

“Existing studies on LLM agents typically prompt LLMs to produce actions in pre-defined formats, e.g., JSON or text, which is limited by their constrained action space.”

The second clever move: use the interpreter’s state as memory. Variables defined in earlier code blocks persist. The model can write df = ... in turn 1 and reference df in turn 5 without re-reading the file. Without this, the agent loses everything between actions and has to reload context every turn.

“We propose to use executable Python code to consolidate LLM agents’ actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations.”

The third clever move: errors become observations. When parse_height(result) fails because the search returned a sentence rather than a number, the interpreter raises a ValueError. The traceback becomes the observation. The model sees what went wrong and writes a corrected version. JSON tool-calling has no equivalent — a malformed JSON call is a hard failure, not a debuggable signal.

“We show that CodeAct can dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions, demonstrating self-debug capability that is unique to CodeAct.”

Does it work? What breaks?

API-Bank (the agent benchmark, 17 LLMs evaluated):

Action formatAvg success rate
JSON53.8%
Text-format tool calls56.2%
CodeAct (Python)64.3%

The 10-point gain holds across closed (GPT-4, Claude) and open models (Llama 2, Mistral, CodeLlama). The gap widens for compositional tasks — when a query needs to combine 3+ tools, CodeAct wins by 20+ points absolute.

The paper also fine-tunes Llama-2 and Mistral on CodeActInstruct, a 7K-example multi-turn dataset they curate, producing CodeActAgent. Compared to ReAct-tuned baselines, CodeActAgent gets +5-10 points on multi-step agentic tasks while preserving the base model’s general capability.

What breaks:

  • Sandbox security. You’re running model-generated Python. The sandbox needs to be airtight: no filesystem escape, no network unless allowed, no resource exhaustion. The paper uses a sandboxed Jupyter kernel; in production you need stricter isolation (Docker, Firecracker, or gVisor).
  • Long sessions blow context. The interpreter state is persistent, but the conversation history (code + outputs) is in the LLM’s context window. Long analyses hit context limits.
  • Code style for LLMs. The model emits whatever style it learned during pretraining. Some models default to Jupyter-style (one expression per cell), others write whole modules. Inconsistency hurts reliability.
  • Tool exposure. Python is so flexible that exposing too many libraries leads to confused tool selection. The paper recommends curating which libraries the agent has access to.

So what?

CodeAct is the format underlying the modern wave of code-executing agents — OpenHands, SmolAgents, Aider’s /code mode, Claude’s code_execution tool, and ChatGPT’s “Advanced Data Analysis.” When Saikat’s agentic VLM pipeline grows from “extract POIs from this image” to “extract POIs, dedupe against existing DB, snap to nearest road, validate against ground truth,” CodeAct is the right action format: Python lets the agent do all four steps in one composable program, with errors at any step automatically recoverable.

For the practitioner, the operational rule is: if your agent needs to compose tools (do A then use A’s output as B’s input), prefer code-as-action over JSON tool-calling. JSON tool-calls win when the action space is small, finite, and untrusted (a customer-facing chatbot that should never run code). Code-as-action wins everywhere else.

“CodeAct … shows up to 20% higher success rate compared to widely used alternatives.”

The deeper principle: the right action format for an LLM is the format the LLM was trained on. The action space is not a property of the agent framework — it’s a property of the model’s pretraining data. Match those, and the agent works. Force a model to operate in a format it has barely seen, and you’ll fight it forever.

Connections

Citation

arXiv:2402.01030

Wang, X., Chen, Y., Yuan, L., Zhang, Y., Li, Y., Peng, H., & Ji, H. (2024). Executable Code Actions Elicit Better LLM Agents. ICML 2024. https://arxiv.org/abs/2402.01030