Alibaba’s Metis agent reduces unnecessary AI tool calls from 98% to 2% and becomes more accurate in doing so.



One of the main challenges in creating effective AI agents is teaching them to choose between using external tools or relying on their internal knowledge. But large language models are often trained to run tools blindly, leading to latency bottlenecks, unnecessary API overhead, and degradation caused by environmental noise.

To overcome this problem, Alibaba researchers introduced Hierarchical Decoupled Policy Optimization (HDPO), a reinforcement learning framework that trains agents to balance both execution efficiency and task accuracy.

Metis, a multimodal model they trained using this framework, reduced unnecessary tool calls from 98% to just 2% while establishing a new state-of-the-art inference accuracy across key industry benchmarks. This framework helps create AI agents that are not trigger-happy and know when to refrain from using tools, enabling the development of responsive and cost-effective agent systems.

Metacognitive deficit

Current agent models suffer from what researchers call a “profound metacognitive deficit.” Models have difficulty deciding when to use their internal parametric knowledge versus when to call on an external utility. As a result, they blindly invoke tools and APIs such as web search or code execution, even when the user’s query already contains all the necessary information to solve the task.

This trigger-happy tool calling behavior poses serious operational hurdles for real-world applications. Because the models are trained to focus almost entirely on task completion, they are indifferent to delay. These agents often reach exorbitant tool call rates. Every unnecessary external API call introduces a serial processing bottleneck, turning a technically competent AI into a sluggish system that frustrates users and burns tooling budgets.

At the same time, burning computing resources with excessive tool usage does not become a better justification. Unnecessary tool interaction adds noise to the context of the model. This noise can distract the model, disrupt an otherwise sound reasoning chain, and actively degrade the final result.

To address the latency and cost issues of blind tool activation, previous reinforcement learning techniques attempted to punish excessive tool use by combining task accuracy and execution efficiency into a single reward signal. However, this tangled design poses an intractable optimization dilemma. If the efficiency penalty is too aggressive, the model becomes overly conservative and stifles basic tool usage at the sacrifice of accuracy in difficult tasks. Conversely, if the penalty is lenient, the optimization signal loses its value and does not prevent the tool from being overused in simpler tasks.

Furthermore, this shared reward creates semantic ambiguity, where an imprecise trajectory with zero tool calls can yield the same reward as an accurate trajectory with excessive tool use. Because training signals are intertwined for accuracy and efficiency, the model cannot learn to control tool use without compromising basic reasoning capabilities.

Hierarchical separated policy optimization

To solve the pooled rewards optimization dilemma, researchers introduced HDPO. HDPO separates accuracy and efficiency into two independent optimization channels. The accuracy channel is aimed at maximizing the accuracy of tasks in all releases of the model. Efficiency optimizes the channel for performance economy.

HDPO calculates the training signals for these two channels independently and combines them only at the final stage of the loss calculation. The efficiency signal is connected to the accuracy channel. This means that a wrong answer is never rewarded simply for being fast or using fewer tools. This separation avoids situations where the gradients of accuracy and efficiency cancel each other out and provides the AI ​​with clean learning signals for both purposes.

The most powerful emerging feature of this decoupled design is that it creates an implicit cognitive curriculum. At the beginning of training, when the model is still struggling with the task, the accuracy objective dominates the optimization, forcing the model to prioritize correct reasoning and knowledge learning. As the model’s reasoning ability matures and it consistently arrives at correct answers, the efficiency signal grows smoothly. This mechanism causes the model to first master the task solution and only then improve its confidence by avoiding unnecessary, expensive API calls.

To complement the HDPO, the researchers developed a rigorous, multi-step data curation regime that addressed critical flaws identified in the expanded datasets with existing tools. Their data curation pipeline includes the stages of supervised fine-tuning (SFT) and reinforcement learning (RL).

For the SFT phase, they obtained data from multimodal trajectories augmented with publicly available tools and filtered them to remove low-quality samples containing execution failures or feedback inconsistencies. They also aggressively filtered out any training samples that the underlying model could directly solve without tools. Finally, use Google Gemini 3.1 Pro as an automated judge, they filtered the SFT corpus to retain only examples demonstrating strategic tool use.

Curation for the RL phase focused on providing a stable optimization signal. They filtered out queries with impaired visual or semantic ambiguity. The HDPO algorithm is based on comparing correct and incorrect answers. If the task is too easy where the model always gets it right or too hard where the model always fails, there is no meaningful mathematical variation to learn. The team strictly maintained guidelines that exhibit only a non-trivial mix of successes and failures to provide an effective gradient signal.

Metis agent: HDPO in action

To test HDPO in action, the researchers used the framework to develop Metis, a multimodal reasoning agent equipped with encoding and retrieval tools. Metis is built on the Qwen3-VL-8B-Instruct visual language model. The researchers trained it in two different phases. First, they applied SFT using their own selected data to ensure a cold-start initialization. Next, they implemented RL using the HDPO framework, subjecting the model to multi-loop interactions that could trigger tools such as Python code execution, text search, and image search.

The researchers compared Metis with standard open-source vision models such as LLaVA-OneVision, text-only grounders, and state-of-the-art agent models including DeepEyes V2 and Skywork-R1V4 with 30 billion parameters. The evaluation covered two main areas: visual perception and document understanding databases such as HRBench and V*Bench, and rigorous mathematical and logical reasoning tasks such as WeMath and MathVista.

In all tasks, Metis achieved state-of-the-art or highly competitive performance, outperforming existing agent models, including the larger 30-billion-parameter Skywork-R1V4, on both visual perception and reasoning tasks.

Equally important is the anecdotal behavior of Métis in practices. For example, when presented with an image of a museum sign and asked what it says in the central text, standard agent models waste time blindly writing Python scripts to crop the image just for reading. However, Metis admits that the text is clearly legible in its raw form. It skips the tools altogether and uses an inference.

In another experiment, the model was given a complex chart and asked to identify the second highest line at a given data point within a small subline. Metis realized that fine-grained visual analysis exceeded local resolution and could not accurately distinguish overlapping lines. Instead of guessing from the full image, it used it to crop and zoom in on just that particular subplot region, allowing Python to correctly identify the line. It treats code as a precision tool deployed only when visual evidence is truly uncertain, not as a default fallback.

The researchers released Metis together with Code for HDPO licensed under the Apache 2.0 license.

“Our results show that strategic tool use and strong reasoning performance are not mutually exclusive; rather, eliminating noisy, redundant tool calls directly contributes to high accuracy,” the researchers said. “More broadly, our work suggests a paradigm shift in tool-augmented learning: from simply teaching models how to execute tools to developing meta-cognitive wisdom about when to avoid them.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *