Alibaba’s AI agent Metis slashes tool calls by 96% while boosting accuracy

AI agents often waste critical resources by overusing external tools, even when the answer lies within their own knowledge base. This problem has long frustrated developers, who struggle to balance speed, cost, and accuracy in real-world applications. Now, Alibaba’s latest innovation may redefine how AI agents operate.

Breaking free from tool-call overuse

Most AI agents today are trained to invoke external tools—whether web searches, code execution, or APIs—by default, regardless of whether the task truly requires them. This "trigger-happy" behavior stems from a fundamental flaw in their training: models are optimized for task completion above all else, ignoring the inefficiencies of unnecessary tool calls. The result? Excessive latency, ballooning operational costs, and degraded reasoning performance as redundant calls inject noise into the model’s context.

In extreme cases, agents may perform up to 98% redundant tool invocations, turning what should be a streamlined process into a sluggish, cost-heavy ordeal. The issue isn’t just about wasted resources—it’s about undermining the very reliability that AI agents are supposed to deliver. When models rely too heavily on external utilities, they risk introducing errors that could have been avoided by trusting their internal knowledge.

Introducing HDPO: A dual-objective solution

To tackle this challenge, Alibaba researchers developed Hierarchical Decoupled Policy Optimization (HDPO), a reinforcement learning framework designed to separate two critical objectives: accuracy and efficiency. Previous approaches tried to merge these goals into a single reward signal, but this created an impossible trade-off. If the penalty for tool use was too harsh, the model would avoid tools entirely—even when they were necessary—sacrificing correctness. If the penalty was too lenient, the model reverted to overusing tools, defeating the purpose.

HDPO breaks this deadlock by treating accuracy and efficiency as independent optimization channels. The accuracy channel focuses solely on maximizing task correctness, while the efficiency channel optimizes for minimizing tool calls. Only at the final stage are these signals combined, ensuring the model is never rewarded for being fast or cheap at the expense of accuracy. This decoupling provides clear, unconflicted learning signals, allowing the AI to refine both its reasoning and its resource management.

The framework also introduces an implicit cognitive curriculum. Early in training, the model prioritizes accuracy, learning to solve tasks correctly before optimizing for speed. As its reasoning improves, the efficiency signal gradually scales up, teaching the model to recognize when it can rely on its own knowledge instead of defaulting to external tools.

Refining the training pipeline for smarter agents

HDPO isn’t the only innovation behind Alibaba’s breakthrough. The team also overhauled the training data pipeline to address flaws in existing tool-augmented datasets. Their approach involved a multi-stage curation process to ensure high-quality learning signals:

Supervised fine-tuning (SFT) phase: The researchers filtered publicly available datasets to remove low-quality examples—such as those with execution failures or inconsistent feedback. They also excluded samples that the base model could solve without tools, ensuring the training data emphasized strategic tool use rather than unnecessary calls.

Reinforcement learning (RL) phase: The focus shifted to stability and variance. Prompts with corrupted visuals or ambiguous semantics were filtered out to prevent noisy optimization signals. The team retained only tasks where the model exhibited a mix of successes and failures, providing meaningful gradients for learning. Trivial or impossibly hard tasks were discarded, as they offered no actionable insights.

To evaluate the quality of the SFT corpus, the researchers used Google’s Gemini 3.1 Pro as an automated judge, filtering examples to retain only those demonstrating efficient tool use. This rigorous pipeline ensures the model learns from high-quality, diverse, and minimally noisy data.

Metis agent: Proof that smarter AI is possible

The results speak for themselves. When trained with HDPO, Alibaba’s Metis agent reduced redundant tool invocations from 98% to just 2% while simultaneously achieving state-of-the-art reasoning accuracy on key industry benchmarks. This isn’t just about cutting costs—it’s about building AI systems that are faster, more reliable, and more aligned with real-world needs.

The implications are far-reaching. For enterprises, this could mean lower cloud bills, reduced latency, and more consistent performance. For developers, it signals a shift toward more deliberate, efficient AI agents that know when to act—and when to abstain. As AI systems grow more complex, frameworks like HDPO may become essential for balancing performance with sustainability.

The future of AI agents isn’t just about doing more—it’s about doing better. And with Metis, Alibaba has shown that the next leap in agentic systems might be closer than we think.

AI summary

Alibaba’nın yeni Metis AI aracı, gereksiz araç çağrılarını %98’den %2’ye düşürerek hem maliyetleri hem de gecikmeleri minimize ediyor. Peki bu devrim nasıl mümkün oldu ve diğer AI ajanlarından ne kadar farklı?

Alibaba’s AI agent Metis slashes tool calls by 96% while boosting accuracy

Breaking free from tool-call overuse

Introducing HDPO: A dual-objective solution

Refining the training pipeline for smarter agents

Metis agent: Proof that smarter AI is possible

Comments

Portable Coding Agent Harness in 400 Lines of Shell

RunPod Flash slashes AI development time by removing Docker dependencies

Why OpenAI banned goblins in GPT-5.5—and what it reveals about AI training