Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models (Metis / HDPO)

Summary

HDPO (Hierarchical Decoupled Policy Optimization) trains Metis, a multimodal agent, to use tools selectively rather than reflexively. The key finding: existing RL approaches for tool efficiency fail because they mix accuracy and tool-use rewards before advantage normalization, causing the efficiency signal to be mathematically swamped by accuracy variance. HDPO fixes this by computing advantages independently — accuracy channel normalized over all rollouts, efficiency channel normalized only over correct rollouts — then combining at the loss level. The resulting model, Metis-8B-RL, reduces tool invocations from ~98% to ~2% while achieving state-of-the-art accuracy on visual reasoning benchmarks.

Key Claims

Coupled reward RL is provably broken for multi-objective optimization: when α is small, the efficiency signal reduces to O(α) and is dominated by accuracy variance in the normalization denominator
Conditional advantage estimation: efficiency advantages computed only within the qualifying set Q of correct rollouts — incorrect rollouts receive zero efficiency advantage, preventing gaming via premature stopping
Implicit curriculum: early training, Q is near-empty so accuracy dominates; late training, Q is large so efficiency is actively trained — two-phase curriculum without any manual scheduling
Tool use rate: 98% → 2% with accuracy improvement, not degradation
Data curation: filter hallucinated environmental dynamics (non-executable code with correct outputs) and tool-unnecessary samples (pass@8=1 on base model without tools)

Methods

Architecture: Qwen3-VL-8B as base model, equipped with code execution, text search, and image search tools. Multi-turn agentic rollouts.

Accuracy reward: R_acc = 0.9·R_ans + 0.1·R_fmt (binary correctness from LLM judge + format compliance). Standard GRPO advantage over all G rollouts.

Efficiency reward: R_tool = 1/(T+1) if correct, else 0. Where T = number of tool calls. Advantage computed only over Q = {correct rollouts}, and only when |Q| ≥ 2.

Loss: L_HDPO = w_acc·L_GRPO(A_acc) + w_tool·L_GRPO(A_tool)

Data curation: (1) Execute all code in sandbox, discard trajectories with failures. (2) Remove samples where base model achieves pass@8=1 without tools — these don’t require tool use.

Failure modes

Hard distributions where |Q| is consistently small: efficiency barely trains
pass@8=1 filter can over-prune tasks that need tools for reliability but happen to be guessable
Efficiency reward counts tool calls, not quality — one bad search and one good search are penalized equally

Connections

training-language-models-to-follow-instructions-with-human-feedback — RLHF background; the reward coupling problem is related to the alignment-tax problem (mixing objectives)
direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO sidesteps reward design; HDPO fixes reward design instead
clip-learning-transferable-visual-models — multimodal foundation models that HDPO-style training builds on
tool-use-agents — selective, meta-cognitive tool invocation is the core capability trained
chain-of-thought — multi-turn agentic rollouts require structured reasoning before tool calls
in-context-learning — base model capabilities that HDPO refines without altering the underlying LLM
alignment — reducing unnecessary tool use while maintaining accuracy is an alignment objective

Citation

arXiv:2604.08545

@article{yan2026metis,
  title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
  author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
  journal={arXiv preprint arXiv:2604.08545},
  year={2026}
}

ML Wiki

Explorer