
One of the assumptions behind today’s AI frameworks is that agents require a “boss” at the center; this orchestra runs the show, routes requests, and makes sure the entire system doesn’t fall into chaos.
This assumption may be wrong, and the cost of carrying it can be measured in output dollars and coordination delay. The new Stanford framework, called the Decentralized Language Model, or DeLM, is based on the premise that agents can communicate each update directly without routing it through a central controller.
DeLM’s shared knowledge base acts as a “common communication substrate” so that agents can build on each other’s validated progress without having to route each interaction through a master agent to “aggregate, filter, and rebroadcast,” explain Yuzhen Mao and Azalia Mirhoseini, co-developers of the framework. research paper.
This is not only possible, but also a desirable system in certain cases. “Agents can build on previous findings, avoid repeat failures, maintain constraints, and recover detailed evidence only when necessary.”
Problems of traditional multi-agent systems
In a typical centralized multi-agent system, a master agent divides tasks into subtasks, assigns them to several subagents in parallel, waits for responses, aggregates and summarizes intermediate progress, and then initiates the next wave of orders based on the collected context.
While this is a natural way to extend LLM reasoning, the Stanford researchers argue that it scales poorly. Every useful find, partial find, and failure must be reported to the parent agent, which then determines which information to aggregate and rebroadcast to its subordinate agents.
“As the number of subtasks increases, this controller becomes a communication and integration bottleneck,” Mao and Mirhoseini write. In addition, the underlying orchestration can “dilute, omit, or distort” useful information, resulting in a loss of progress.
This bottleneck also occurs in long contextual reasoning scenarios. After receiving reports from subagents, the master agent typically groups related concepts, data points, and other material into an unsupervised learning loop. You can then preset them "groups of evidence" To sub-agents before knowing what surfaced material is actually relevant or whether it has been put together properly.
When the subagent receives this insufficient context, it will essentially become confused and return to the parent agent, triggering another search or delegation. “This makes back-and-forth coordination slower, more iterative, and increasingly constrained by the single-loaded principal agent,” the researchers write.
What does DeLM address and how does it work?
DeLM, in contrast, is built around parallel agents, shared context, and task queuing.
Shared context is essentially a curated repository of “goals” or information summaries that other agents may find useful. These include validated and evidence-based findings alongside partial findings and documented failures; they also point to the detailed evidence that agents can obtain based on their specific tasks.
A task queue is a set of next-pending subtasks that agents can claim independently.
“Agents write compact, confirmed updates to a shared context that agents can later read directly,” the researchers write. Useful findings, failures, and limitations are collected as “common problem states” instead of being routed through a central controller.
The pipeline looks like this:
-
Initialization: Entries are divided into different units of work and added to the queue;
-
Parallel execution: Agents work independently and in tandem, executing tasks and reading shared context as they go.
-
Compression and verification: Conclusions are compressed into reusable “points” that are checked against supporting evidence. Only fully verified key information is shared with the group.
-
Additional work (if necessary): When the queue is empty, the last agent to return a response checks the entire shared context to determine if any additional work is required.
-
Last step: The final agent determines that no further steps are required and returns a final response.
Agents “exchange progress via shared state, asynchronously request ready tasks, and scale more adaptively as the number of subtasks increases,” the researchers explain.
How DeLM works in nature
With DeLM, agents can avoid unnecessary intelligence; reuse and build on each other’s discoveries and failures; and focuses on unresolved issues.
The framework can be particularly useful at the scale of software engineering test time when models are given time to “think” to improve their reasoning and problem-solving abilities. Different agents can explore their hypotheses or pursue reasoning paths in parallel, while sharing intermediate progress. One example is concurrent de-bagging.
DeLM is also suitable for answering long contextual reasoning and multi-document questions; agents can simultaneously inspect their own sets of evidence (sets of targets, code, or other materials) while maintaining a “global compact view” of collected evidence.
Researchers claim this makes agent tasks more accurate and significantly cheaper. This is supported by its performance in real-world benchmarks: in SWE-bench Verified, which evaluates how well AI models and agents solve real-world software engineering problems, it outperformed the most powerful baseline by 10.5% and reduced the cost per task by nearly 50%.
But it can go beyond coding: In LongBench‑v2 Multi‑Doc QA—which assesses LLMs’ ability to solve long-context, real-world problems—DeLM had the highest accuracy across four model families, including GPT‑5.4, Claude Sonnet, Gemini Flash, and DeepSeek‑Pro.V4.
DeLM outperforms other models in SWE-Bench for a number of reasonsAs Mao detailed in X.
First, agents share failures. In conventional parallel runs, when one agent goes down the wrong path, that failure remains private, and subsequent agents can waste time (and money) chasing the same dead end. But with DeLM, failed assumptions are written in a shared context.
“Agents can then read them as constraints, avoid re-exploration, and redirect their search to more promising fixes,” Mao said.
In addition, constraints are immediately added to the agents’ shared context after validation. This means that they become a mandatory partner state. “Agents then inherit them, build around them, and avoid repeating invalid simplifications globally,” Mao said.
Importantly, DeLM keeps shared progress compact enough to be reused. It’s open, meaning agents see brief content by default, but can choose to convert it to more detailed summaries and raw evidence.
As the researchers note, providing all raw documents and traces provides agents with maximum information, but it can overwhelm their context windows and ultimately increase costs.
“If agents shared full traces, each worker would have to read long command histories, file dumps, failed edits, and intermediate considerations, making coordination itself a bottleneck in another long context,” Mao said.
On the other hand, while it is cheaper to share concise summaries, important details and evidence may be lost, resulting in a less reliable rationale.
Disclosure therefore allows access from the “gross to the subtle”. This can increase accuracy and cost.
Finally, with a framework like DeLM, agents can be more efficient because they are prevented from reading the same documents multiple times or performing the same failed analysis over and over again; more efficient because useful findings are propagated between parallel threads; and more reliable because they only share verified claims.
For enterprise builders, DeLM rejects a fundamental assumption: that every multi-agent workflow needs a central controller. The SWE-bench and LongBench-v2 results show that the decentralized model is not theoretically cleaner – it’s faster, more accurate, and about half the cost.





