The new Alibaba AI framework skips the download of every tool, the cutting agent mark uses 99%



As enterprise AI systems expand to handle complex workflows, practitioners face the challenge of routing subtasks to the right tools and skills. Agents can have hundreds of tools and skills, and it can be confusing to know which one to use for each step of the workflow.

To solve this problem, Alibaba researchers have developed SkillWeaverframework that creates an execution graph for a given task and selects the correct skills for each of the nodes. They also introduce Skill-Aware Decomposition (SAD), a new technique that uses a feedback loop to allow the agent to iteratively retrieve and test appropriate tool candidates. This compositional approach and feedback loop mechanism sets SkillWeaver apart from other tool routing frameworks that select tools with a single stroke.

SkillWeaver applies to real-world AI applications where agents autonomously orchestrate multi-tool ecosystems such as Model Context Protocol (MCP) to perform multi-step business operations such as downloading datasets, transforming data, and generating visual reports.

In practice, researchers’ experiments with SkillWeaver show that applying this search and routing approach significantly increases accuracy while reducing token consumption by more than 99% compared to naively exposing agents to the entire tool library.

A key takeaway for practitioners developing AI agents is that the refinement of task decomposition is the biggest obstacle to accurate search tools.

The problem of directing skills

Skills are a key pattern in modern LLM agent architectures. A capability is a modular, reusable tool specification that uses structured natural language documentation.

As enterprise agents integrate with massive tool ecosystems, routing user queries to the right skills becomes a challenge. Submitting the entire library to LLM to find the right tool is very inefficient, quickly exceeds context limits, and wastes hundreds of thousands of tokens.

Most existing tooling frameworks attempt to address this through API lookup, document matching, or hierarchical structures that treat routing strictly as a single skill selection or per-step problem.

However, this single-skill paradigm is insufficient for enterprise environments, as real-world queries are inherently compositional. As a standard job requirement "Download datasets, convert and generate visual reports" cannot be accomplished by one means. This requires breaking down the command and turning the API client, data processor, and visualization tool into a unified, multi-step execution plan.

How SkillWeaver and SAD work

To overcome this, researchers pose the problem of solving complex tasks that require many skills "channeling compositional skills." Given a complex user query and a large library of tools, the agent must also understand how to break the query into a sequence of atomic subtasks, how to associate each subtask with the best available skill, and how to map those skills into an executable plan.

SkillWeaver takes this process through three distinct steps: Decompose, Retrieve, and Compose. In the first step, LLM acts as a task decomposer, breaking down the user’s complex query into a sequence of subtasks, each requiring a skill. Once the subtasks are clearly defined, the system uses a deployment model to compare each subtask against a skill library to draw a shortlist of the best candidate tools for each step.

In the final step, the scheduler evaluates the acquired candidates based on how well they work together. Checks cross-skill compatibility to ensure that the outputs of one tool fit naturally into the inputs of another. It then creates a final execution plan, such as a Directed Acyclic Graph (DAG), that maps dependencies so that independent tasks can potentially be executed in parallel.

For example, consider a user asking an AI agent "Download datasets, convert and generate visual reports." In the decomposition phase, the decomposer LLM breaks it down into three distinct subtasks: loading the dataset, transforming the data, and generating reports.

In the search phase, the system searches the library and calls “api-client” or “http-fetch” for the first task, “csv-parser” or “etl-pipeline” for the second task, and so on. finds such candidates. Finally, the compilation phase evaluates these options, selects the specific combination of “api-client”, “csv-parser” and “chart-gen” that is most appropriate, and combines them into a final, ready-to-run workflow.

A major problem with this pipeline is that LLMs often create generic step descriptions that do not correspond to the specific, technical vocabulary of the actual skills available in the library. To fix this, SkillWeaver introduces a new feedback loop, Iterative Skill Decomposition (SAD). SAD works by preplanning the LLM, conducting an initial search to find loosely matched skills, and then feeding those acquired skills back to the LLM as instructions. This allows LLM to rewrite its decomposition so that the granularity and vocabulary perfectly match the actual tools available.

SkillWeaver is in action

To evaluate how SkillWeaver performs in real-world enterprise scenarios, the researchers created a custom benchmark called CompSkillBench. It consists of 300 multi-step questions of different difficulty levels. To reflect real-world environments, they used a library of 2,209 real-world skills from the public MCP ecosystem, covering 24 functional categories such as cloud infrastructure, finance, and databases.

For the main engine, the researchers primarily used a lightweight 7 billion parameter model (Qwen2.5-7B-Instruct) for task decomposition combined with a standard semantic search retriever (FAISS-indexed MiniLM) to find tools. SkillWeaver was evaluated with three main setups: brute force "LLM-Direct" the way they put all their tool names into one big model, vanilla LLM-based decomposition without SAD, and a ReAct-style agent loop.

Experience shows that task fragmentation is a major bottleneck. Standard LLM behavior falls short when working with large tool libraries, but the SAD feedback loop moves the needle dramatically. In the vanilla setup, the 7B model achieved fragmentation accuracy (ie, predicting the number of correct steps) only 51.0% of the time. By enabling the SAD feedback loop, the accuracy increased to 67.7% (accuracy reached 92% with the larger Qwen-Max model). Active "difficult" tasks requiring four to five different skills, SAD increased accuracy by 50%.

An interesting finding was that larger models may perform worse when not controlled. When tested on a vanilla rig, the accuracy of the larger 14-billion-parameter model fell short of that of the 7B model because it tended to over-decompose tasks into microscopic, redundant steps. After SAD was introduced, the resulting tooltips brought the model back to reality and improved its accuracy. This suggests that fitting an agent with a vocabulary of specialized tools is often more effective than paying for a larger, more expensive LLM.

Another important way is token saving. The LLM-Direct database using a very large Qwen-Max model showed that bringing all the instruments to the command of a large model fails. Despite excellent task decomposition capabilities, only 21.1% obtained the correct tool category when the massive model was loaded with tool options. SkillWeaver’s targeted search and routing approach outperformed this in accuracy with a 99.9% reduction, while reducing context window consumption from an estimated 884,000 tokens per request to approximately 1,160 tokens. For practitioners, this translates directly into dramatically lower API costs and faster response times.

Finally, the traditional ReAct database completely failed and achieved 0% fragmentation accuracy. Rather than mapping out a naturally connected, multi-tool sequence, his loop turns multi-step plans into isolated activities.

Considerations for developers

Although the researchers have yet to release the source code for SkillWeaver, their work is based on off-the-shelf tools that can be easily replicated.

Skill-Aware Decomposition (SAD), a key innovation at the heart of the framework, is an intelligent operational engineering and search loop. The authors have shared quick templates in their articles, and developers can easily implement this using standard orchestration libraries such as LangChain, LlamaIndex, or even raw Python scripts.

As for the search component, the authors built it using the basic framework all-MiniLM-L6-v2open source deployment model. They found that it could be replaced with a slightly more powerful off-the-shelf encoder (BGE-base-en-v1.5) improved accuracy immediately without any fine-tuning. While the off-the-shelf bi-encoder is excellent at including the appropriate instrument in the top 10 candidates about 70% of the time, it struggles to consistently rank the perfect instrument at a full number, succeeding only about 37% of the time. To bridge this gap, teams will likely need to implement a second-order cross-encoder or LLM-based reranker to rerank those top 10 candidates.

One of the prerequisites is to vectorize the tool library and build the FAISS index beforehand. In practice, this is an insignificant obstacle. It took only 15 seconds to enter and index all 2,209 skills in the test. Retrieving tools from the index after installation adds less than 15 milliseconds of latency per request. For enterprise environments, tool index synchronization is a trivial background job.

A limitation in SkillWeaver is the lack of error recovery. While SkillWeaver successfully mapped a suitable DAG to execute, the authors’ pilot study revealed the challenges of multi-step toolchains. For example, if the API call fails in the second step, the entire chain breaks. The main contribution of the paper is limited to the routing and planning phase. For real production deployments, practitioners should build their own error recovery, rollback, and retry mechanisms on top of the compilation phase to handle real-world API breaks or malformed outputs.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *