Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

1Accio Team, Alibaba Group
2Huazhong University of Science and Technology
*Equal Contribution Project Leader Corresponding Author

Abstract

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose Hierarchical Decoupled Policy Optimization (HDPO), a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum—compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude (e.g., from 98% to 2%) while simultaneously elevating reasoning accuracy. By shattering the illusion that heavy tool reliance equates to better performance, Metis pioneers a shift from merely executing tools to cultivating the meta-cognitive wisdom of abstention.

Metis Framework Overview: A strategic multimodal reasoning agent that selectively invokes code execution, text search, and image search tools during multi-turn reasoning

Key Insights & Results

Performance Results

Perception & Document
ModelV*BenchHR4KHR8KTreeBenchMME-RWSEED2+CharXiv(DQ)CharXiv(RQ)
Open-Source Models
LLaVA-OneVision75.463.059.837.357.465.4--
InternVL3-8B81.270.069.338.8-69.773.637.6
Qwen2.5-VL-7B75.365.562.137.056.870.472.740.2
Qwen2.5-VL-32B80.669.363.642.559.172.483.248.0
Qwen3-VL-8B86.478.974.640.761.971.083.046.3
Agentic Multimodal Models
Pixel-Reasoner84.372.666.139.064.4---
DeepEyes83.373.269.537.564.1---
Thyme82.277.072.0-64.8---
DeepEyesV281.877.973.842.564.970.578.648.9
Mini-o388.277.573.3-65.5---
SenseNova-MARS-8B92.283.178.4-67.9---
Skywork-R1V4-30B88.082.879.8-71.4---
Metis (Ours)91.183.582.045.270.372.583.454.1

Perception & Document Understanding Benchmarks

Math & Reasoning
ModelMathVistaMathVerseWeMathDynaMathLogicVistaAvg.
Open-Source Models
LLaVA-OneVision58.619.320.9-33.3-
Qwen2.5-VL-7B68.345.634.653.345.949.5
InternVL3-8B71.639.837.1-44.1-
Qwen3-VL-8B76.361.338.865.554.959.4
Text-only Reasoning Models
MM-Eureka-7B72.650.321.8-46.3-
ThinkLite-VL-7B75.152.141.8-42.7-
VL-Rethinker-7B74.954.236.3-42.7-
VLAA-Thinker-7B71.7-35.7-45.9-
Agentic Multimodal Models
DeepEyes70.147.338.955.047.751.8
Thyme70.0-39.3-49.0-
DeepEyesV271.952.738.157.248.753.7
Metis (Ours)78.065.965.269.256.266.9

Mathematical & Logical Reasoning Benchmarks

Ablation
MethodV*BenchHR4KHR8KCharXiv(RQ)MathVista
Standard GRPO (wtool=0)88.781.079.251.076.9
HDPO (wtool=0.10)88.083.581.052.777.4
HDPO (wtool=0.15) ✔91.183.582.054.178.0
HDPO (wtool=0.20)87.482.580.551.577.2

Ablation: Effect of Tool-Efficiency Weight wtool

Contact & Opportunities

If you have any questions about this project, please feel free to contact:

BibTeX

@article{yan2026metis,
  title={Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models},
  author={Yan, Shilin and Tong, Jintao and Xue, Hongwei and Tang, Xiaojun and Wang, Yangyang and Shi, Kunyu and Zhang, Guannan and Li, Ruixuan and Zou, Yixiong},
  journal={arXiv preprint arXiv:2604.08545},
  year={2026}
}