Summary

HDPO (Hierarchical Decoupled Policy Optimization) trains Metis, a multimodal agent, to use tools selectively rather than reflexively. The key finding: existing RL approaches for tool efficiency fail because they mix accuracy and tool-use rewards before advantage normalization, causing the efficiency signal to be mathematically swamped by accuracy variance. HDPO fixes this by computing advantages independently — accuracy channel normalized over all rollouts, efficiency channel normalized only over correct rollouts — then combining at the loss level. The resulting model, Metis-8B-RL, reduces tool invocations from ~98% to ~2% while achieving state-of-the-art accuracy on visual reasoning benchmarks.

Key Claims

  • Coupled reward RL is provably broken for multi-objective optimization: when α is small, the efficiency signal reduces to O(α) and is dominated by accuracy variance in the normalization denominator
  • Conditional advantage estimation: efficiency advantages computed only within the qualifying set Q of correct rollouts — incorrect rollouts receive zero efficiency advantage, preventing gaming via premature stopping
  • Implicit curriculum: early training, Q is near-empty so accuracy dominates; late training, Q is large so efficiency is actively trained — two-phase curriculum without any manual scheduling
  • Tool use rate: 98% → 2% with accuracy improvement, not degradation
  • Data curation: filter hallucinated environmental dynamics (non-executable code with correct outputs) and tool-unnecessary samples (pass@8=1 on base model without tools)

Methods

Architecture: Qwen3-VL-8B as base model, equipped with code execution, text search, and image search tools. Multi-turn agentic rollouts.

Accuracy reward: R_acc = 0.9·R_ans + 0.1·R_fmt (binary correctness from LLM judge + format compliance). Standard GRPO advantage over all G rollouts.

Efficiency reward: R_tool = 1/(T+1) if correct, else 0. Where T = number of tool calls. Advantage computed only over Q = {correct rollouts}, and only when |Q| ≥ 2.

Loss: L_HDPO = w_acc·L_GRPO(A_acc) + w_tool·L_GRPO(A_tool)

Data curation: (1) Execute all code in sandbox, discard trajectories with failures. (2) Remove samples where base model achieves pass@8=1 without tools — these don’t require tool use.

Failure modes

  • Hard distributions where |Q| is consistently small: efficiency barely trains
  • pass@8=1 filter can over-prune tasks that need tools for reliability but happen to be guessable
  • Efficiency reward counts tool calls, not quality — one bad search and one good search are penalized equally

Connections

Citation

arXiv:2604.08545

@article{yan2026metis,
  title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
  author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
  journal={arXiv preprint arXiv:2604.08545},
  year={2026}
}