How Sakana Trained Model 7B to Control GPT-5, Claude Sonnet 4 and Gemini 2.5 Pro



Every LangChain pipeline your team hardcodes starts to break the moment the request distribution changes – and it always changes. This bottleneck is what Sakana AI tries to overcome.

Sakana AI researchers have introduced it "RL Conductor," a small language model trained by reinforcement learning to automatically organize a diverse pool of worker LLMs. A conductor dynamically analyzes inputs, distributes work among workers, and coordinates among agents.

This automated coordination achieves state-of-the-art results on challenging reasoning and coding criteria, outperforming individual boundary models such as GPT-5 and Claude Sonnet 4, as well as expensive human-designed multi-agent pipelines. It achieves this performance at a fraction of the cost and with fewer API calls than competitors. RL Conductor is the core of Sakana AI’s commercial multi-agent orchestration service, Fugu.

Limitations of manual agent frameworks

Large language models have powerful implicit capabilities. However, fully exploiting these opportunities is a big challenge. Extracting this level of performance relies on hand-crafted agent workflows that serve as critical components in commercial AI products.

However, these frameworks fall short because they are inherently rigid and limited. In comments to VentureBeat, paper co-author Yujin Tang explained the exact breaking point of existing systems: "When using frameworks with hard-coded pipelines like LangChain and Mixture-of-Agents, it can work well for specific use cases… In production, a unique bottleneck arises when targeting domains with large user bases with very heterogeneous requirements."

Tang noted that he had succeeded "Real-world generalization in such heterogeneous applications naturally necessitates going beyond human-coded designs."

Another obstacle to building robust agent systems is that no single model is optimal for all tasks. Different models are fine-tuned to specialize in different domains. One model may excel at scientific reasoning, while another excels at code generation, mathematical logic, or high-level planning.

Because models have these different characteristics and complementary capabilities, it is practically impossible to manually predict and hard-code the ideal combination of models for each query. An optimal agent framework should be able to analyze the problem and delegate subtasks to the most appropriate specialist in the pool.

Conducting an orchestra of agents

The RL Conductor is designed to overcome the limitations of rigid, man-made frames. As its name suggests, it manages an orchestration of agents by dividing difficult problems, delegating targeted subtasks, and designing communication topologies for a number of worker LLMs.

Rather than relying on fixed code or static routing, Conductor organizes these models by creating customized workflows. For each step in the workflow, the model generates a natural language instruction for a specific aspect of the task, assigns an agent to execute it, and "access list" dictates which past subtasks and responses from other agents are included in that agent’s context.

By defining everything in natural language, Conductor builds flexible workflows tailored to each input. Depending on the requirements of the problem, it can build simple sequential chains, parallel tree structures, or even recursive loops.

Importantly, the model learns these strategies through reinforcement learning (RL) and reward maximization rather than by human design. During training, the model is given a task, a pool of workers, and a reward signal based on whether its response and output format are correct.

Through a simple trial-and-error RL algorithm, the model organically discovers which combinations of instructions and communication structures yield the highest reward. As a result, it automatically adopts advanced orchestration strategies such as targeted operational engineering, iterative refinement, and meta-rest optimization.

The model learns to dynamically adjust its strategies and exploit the different strengths of worker agents, without any human developer hard-coding the process.

A conductor in action

To test the RL Conductor in action, the researchers fine-tuned the 7 billion Qwen2.5-7B parameter using the framework. During the training, the Conductor was tasked with designing agent workflows with up to five steps. It was given access to a working pool of seven different models: three closed-source giants (Gemini 2.5 Pro, Claude-Sonnet-4, and GPT-5) and four open-source models (including DeepSeek-R1-Distill-Qwen-32B, Gemma3-27B, and Qwen3-).

The team evaluated Conductor against a variety of highly challenging benchmarks, comparing it to individual frontier models acting alone, self-reflective agents that are iteratively driven to improve their responses, and state-of-the-art multi-agent routing frameworks such as MASRouter, Mixture-of-Agents (MoAmooth). The Junior 7B Conductor set new benchmarks across the board. According to the researchers, it scored an average of 77.27% on all tasks, 93.3% on the AIME25 math test, 87.5% on GPQA-Diamond and 83.93% on LiveCodeBench.

Amazingly, it achieved these marks while maintaining high efficiency. While baseline models like MoA burn 11,203 tokens per question, Conductor used an average of just 1,820 tokens and took an average of three steps per workflow.

A closer look at the experimental details shows exactly why the framework is so effective. The conductor automatically learned to gauge the difficulty of the task. For simple factual callback questions, it often solves the problem in one step or uses a basic two-agent setup. However, for complex coding problems, he established extensive workflows involving four agents with specific planning, implementation, and verification phases.

The conductor also learned that boundary models have different strengths. To achieve a record score in coding benchmarks, Conductor often set Gemini 2.5 Pro and Claude Sonnet 4 to act as high-level schedulers, bringing in GPT-5 only to write the final optimized code. In a particularly clever display of adaptability, the Conductor sometimes abdicates its role entirely, handing over the entire scheduling process to Gemini 2.5 Pro and letting it dictate subtasks for the rest of the pool.

Apart from math and coding benchmarks, Sakana AI already lays the basic architecture to work in the front office utility. "We use our Conductor-based Fugu models internally for a variety of practical enterprise applications: software development, deep research, strategy development, and even visual tasks like creating slides." Tang said.

Bringing an orchestra to the establishment: Sakana Fugu

Although the 7B model described in the research paper is an exploratory design and has not been made public, Sakana has developed the AI ​​Conductor framework into its flagship commercial AI product, Sakana Fugu. Fugu, now in beta, serves as a multi-agent orchestration system accessible through a standard OpenAI-compliant API.

Tang Fugu noted his targets "A large industrial market where AI adoption has yet to gain traction due to generalization limitations of current hard-coded pipelines such as finance and defense."

For enterprise developers, this enables seamless integration into existing applications without the headache of managing multiple API keys or manually routing tasks between different vendors. Behind the API interface, Fugu automates complex collaboration topologies and role assignments in a pool of models. To support different business needs, Sakana released two variants: Fugu Mini, built for low-latency operations, and Fugu Ultra, designed for maximum performance in demanding workloads.

Addressing governance concerns around autonomous agents spinning invisible workflows, Tang noted that interpretable risks are functionally similar to the hidden reasoning traces of current high-level closed APIs, and that the system is managed with safeguards built to minimize hallucinations.

For enterprise architects considering when to implement RL-orchestration versus traditional routing, the decision often comes down to engineering resources. "We believe that the absolute sweet spot comes when users and their teams feel they are spending a disproportionate amount of time guiding their key agents," Tang said. However, he noted, cautioning that the framework is not necessary for everything "For simple queries, the economic proposition of a native model running directly on the user’s machine is hard to beat."

As the variety of custom open- and closed-source AI models continues to grow, statically coded pipelines will inevitably become obsolete. Looking to the future, this dynamic orchestration will likely extend beyond text and code environments. "There is really great potential to fill this gap with cross-modal Conductor frameworks, the basis for more autonomous, self-coordinating physical AI systems." Tang said.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *