Summary
HDPO (Hierarchical Decoupled Policy Optimization) trains Metis, a multimodal agent, to use tools selectively rather than reflexively. The key finding: existing RL approaches for tool efficiency fail because they mix accuracy and tool-use rewards before advantage normalization, causing the efficiency signal to be mathematically swamped by accuracy variance. HDPO fixes this by computing advantages independently — accuracy channel normalized over all rollouts, efficiency channel normalized only over correct rollouts — then combining at the loss level. The resulting model, Metis-8B-RL, reduces tool invocations from ~98% to ~2% while achieving state-of-the-art accuracy on visual reasoning benchmarks.
Key Claims
- Coupled reward RL is provably broken for multi-objective optimization: when α is small, the efficiency signal reduces to O(α) and is dominated by accuracy variance in the normalization denominator
- Conditional advantage estimation: efficiency advantages computed only within the qualifying set Q of correct rollouts — incorrect rollouts receive zero efficiency advantage, preventing gaming via premature stopping
- Implicit curriculum: early training, Q is near-empty so accuracy dominates; late training, Q is large so efficiency is actively trained — two-phase curriculum without any manual scheduling
- Tool use rate: 98% → 2% with accuracy improvement, not degradation
- Data curation: filter hallucinated environmental dynamics (non-executable code with correct outputs) and tool-unnecessary samples (pass@8=1 on base model without tools)
Methods
Architecture: Qwen3-VL-8B as base model, equipped with code execution, text search, and image search tools. Multi-turn agentic rollouts.
Accuracy reward: R_acc = 0.9·R_ans + 0.1·R_fmt (binary correctness from LLM judge + format compliance). Standard GRPO advantage over all G rollouts.
Efficiency reward: R_tool = 1/(T+1) if correct, else 0. Where T = number of tool calls. Advantage computed only over Q = {correct rollouts}, and only when |Q| ≥ 2.
Loss: L_HDPO = w_acc·L_GRPO(A_acc) + w_tool·L_GRPO(A_tool)
Data curation: (1) Execute all code in sandbox, discard trajectories with failures. (2) Remove samples where base model achieves pass@8=1 without tools — these don’t require tool use.
Failure modes
- Hard distributions where |Q| is consistently small: efficiency barely trains
- pass@8=1 filter can over-prune tasks that need tools for reliability but happen to be guessable
- Efficiency reward counts tool calls, not quality — one bad search and one good search are penalized equally
Connections
- training-language-models-to-follow-instructions-with-human-feedback — RLHF background; the reward coupling problem is related to the alignment-tax problem (mixing objectives)
- direct-preference-optimization-your-language-model-is-secretly-a-reward-model — DPO sidesteps reward design; HDPO fixes reward design instead
- clip-learning-transferable-visual-models — multimodal foundation models that HDPO-style training builds on
- tool-use-agents — selective, meta-cognitive tool invocation is the core capability trained
- chain-of-thought — multi-turn agentic rollouts require structured reasoning before tool calls
- in-context-learning — base model capabilities that HDPO refines without altering the underlying LLM
- alignment — reducing unnecessary tool use while maintaining accuracy is an alignment objective
Citation
@article{yan2026metis,
title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
journal={arXiv preprint arXiv:2604.08545},
year={2026}
}