How RecursiveMAS speeds up multi-agent inference by 2.4x and reduces token usage by 75%



One of the main problems with existing multi-agent AI systems is that they communicate by generating and sharing text sequences, which introduces latency, increases token costs, and makes it difficult to train the entire system as a single unit.

To overcome this challenge, researchers from the University of Illinois at Urbana-Champaign and Stanford University RecursiveMOREframework that allows agents to collaborate and communicate information via a location field instead of text. This change results in both efficiency and performance gains.

Experiments show that RecursiveMAS achieves accuracy improvements in complex domains such as code generation, medical reasoning, and search, while increasing inference speed and reducing token usage.

RecursiveMAS is much cheaper than standard full fine-tuning or LoRA methods, making it a scalable and cost-effective plan for individual multi-agent systems.

Improvement problems of multi-agent systems

Multi-agent systems can help handle complex tasks that single-agent systems struggle to handle. A major challenge when scaling multi-agent systems for real-world applications is ensuring that the system evolves, improves, and adapts to different scenarios over time.

Fast-based adaptation improves agent interactions by iteratively improving the shared context presented to agents. By updating the requirements, the system acts as a director, guiding the agents to generate responses that are more consistent with the underlying goal. The main limitation is that the capabilities of the models underlying each agent remain static.

A more sophisticated approach is to train agents by updating the weights of the underlying models. Training an entire system of agents is difficult because it is computationally trivial to update all parameters across multiple models.

Even if the engineering team commits to training their own models, the standard method for agents to communicate through text-based interactions presents major obstacles. Because agents rely on sequential text generation, this introduces latency, as each model must wait for the previous one to generate its text before starting its own processing.

It is very inefficient to force models to tokenize their intermediate reasoning so that the next model can read it. It dramatically increases token usage, increases computational costs, and makes iterative learning painfully slow to scale across the entire system.

How RecursiveMAS works

Instead of trying to improve each agent as an isolated, independent component, RecursiveMAS is designed to co-develop and scale the entire multi-agent system as a single integrated whole.

The frame is inspired recursive language models (RLMs). In the standard language model, data flows linearly through a stack of distinct layers. In contrast, the recursive language model reuses a series of shared layers that process data and return it to itself. By looping the computation, the model can deepen its reasoning without adding parameters.

RecursiveMAS extends this scaling principle from a single model to a multi-agent architecture acting as a single recursive system. In this setup, each agent acts as a layer in a recursive language model. Instead of generating text, agents sequentially transmit their persistent secret representations to the next agent in succession, creating a stream of secret information flowing through the system.

This secret handover continues through all agents. When the last agent finishes processing, its hidden outputs are returned directly to the first agent, starting a new round of recursion.

This structure allows the entire multi-agent system to interact, reflect, and improve its collective judgments in a completely private space over multiple rounds, with only the last agent generating the text output in the final round. It’s as if the agents communicate telepathically as a single entity, and the last agent delivers the final response as a text.

The architecture of covert cooperation

To enable continuous covert spatial collaboration, the authors introduce a special architectural component called RecursiveLink. It’s a lightweight, two-layer module designed to convey and specify implicit states of a model, rather than forcing text to be decoded.

The hidden states of the final layer of a language model contain a rich, semantic representation of its reasoning process. RecursiveLink is designed to store and transfer this high-dimensional data from one deployment location to another.

To avoid the overhead of updating each parameter in multiple large language models, the framework keeps the models’ parameters frozen. Instead, it optimizes the system by only training the parameters of the RecursiveLink modules.

To handle both internal considerations and external communication, the system uses two variants of the module. Internal RecursiveLink operates inside the agent during the reasoning phase. It takes the model’s newly created inputs and copies them directly into its own input space. This allows the agent to generate a continuous stream of implicit ideas without creating discrete textual tokens.

External RecursiveLink acts as a bridge between agents. Because agents in a real-world system may use different model architectures and dimensions, their internal deployment spaces have completely different dimensions. External RecursiveLink contains an additional layer designed to map inputs from one agent’s hidden dimension to the next agent’s embedding space.

During training, firstly, internal connections are trained independently to warm up each agent’s ability to reason in continuous hidden deployments. The system then enters outer-loop training, where different, frozen models are chained together in a loop and the system is evaluated based on the final text output of the last agent.

The only thing that is updated during the training process is the RecursiveLink parameters, and the original model weights remain unchanged. low level adaptation (LoRA). Another advantage of this system comes into play when you have multiple agents on top of the same trunk model.

If you have a multi-agent system where two agents are built on the same base model acting in different roles, you don’t need to load two copies of the model into your GPU memory or train them separately. Agents will share the same backbone as the brain and use RecursiveLink as the connective tissue.

RecursiveMAS is up and running

The researchers evaluated RecursiveMAS on nine benchmarks, including math, science and medicine, code generation, and answering search-based questions. They created a multi-agent system using open weight models including Gwen, Llama-3, Gemma3 and Mistral. These models were assigned roles to generate different agent collaboration patterns, such as sequential reasoning and mix of experts.

RecursiveMAS was compared to benchmarks under the same training budgets, including independent models enhanced with LoRA or fully supervised fine-tuning, alternative multi-agent frameworks such as Blend of Agents and TextGrad, and recursive databases such as LoopLM. It was also compared to Recursive-TextMAS, which uses the same recursive loop structure as RecursiveMAS but forces agents to communicate explicitly via text.

RecursiveMAS achieved an average accuracy improvement of 8.3% compared to the strongest benchmark benchmarks. It particularly excelled at tasks requiring heavy thinking, outperforming text-based optimization methods such as TextGrad by 18.1% in AIME2025 and 13% in AIME2026.

By avoiding generating text at each step, RecursiveMAS achieved 1.2x to 2.4x end-to-end inference speed. RecursiveMAS is also more remarkable than the alternative. Compared to text-based Recursive-TextMAS, it reduces token usage by 34.6% in the first round of recursion and achieves a token reduction of 75.6% in the third round. RecursiveMAS has also proven to be quite cheap to train. Because it updates lightweight RecursiveLink modules consisting of only about 13 million parameters, or about 0.31% of the trainable parameters of frozen models, it requires the lowest peak GPU memory and cuts the training cost by more than half compared to full refinement.

Enterprise reception

The efficiency gains – lower token consumption, reduced GPU memory requirements, and faster throughput – are designed to enable complex multi-step agent workflows in production environments without the computational costs that limit enterprise agent deployments. The researchers announced code and trained model weights Licensed under Apache 2.0.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *