Agent AI is a big thing right now with names like OpenClaw and the like NemoClaw telling you to use it or stay away by filling out the column inches. Agents can organize your computer or clear the inboxand Microsoft wants to put them in everything Windows.
The thing is, like the capabilities of advanced LLMs Claude the use cases for them are growing faster than we can write them. Six months have not passed Orchestras for Claude and other LLMs started to take over GitHub, but that’s long in the frontier model time, and since then LLMs have improved exponentially.
How good is it? Well, a new paper tested against design orchestrators and self-organizing LLMsand the creators of those GitHub projects won’t like the results. Or maybe, because it partially proves the value of pre-designed hierarchies, but only if LLMs can organize themselves within that structure. TL: DR? LLM agents are much more capable than we think and need only subtle coaxing to deliver their best results when a challenge arises.
There is a problem with the multi-agent architecture
When an output becomes an input, inaccuracies multiply
Building any system at scale is difficult because you need to ensure that the data is reliable no matter where it comes from or how it goes. We all know that AI agents can hallucinate, lie, make things up, or be inaccurate in some of their results while pretending to provide valuable feedback.
with multi-agent orchestrationsthis problem is compounded as each inaccuracy compounds. how much well, Google’s DeepMind tested this in 2025With 180 configurations across 5 agent architectures and three major LLMs. The result? Unstructured multi-agent networks increase errors by a factor of 17.2 compared to single-agent baselines.
Unstructured multi-agent networks increase errors by a factor of 17.2 compared to single-agent baselines.
Seventeen times worse. At this point, you can place the potential outcomes on a spinning dartboard and choose via a single dart thrown by a blindfolded player. You can get a more accurate answer.
The study also showed that any performance gains were not out of scale four agents, the coordination burden ate away any benefits. This seems quite different from industry players I know who use anywhere from 6 to 20 agents simultaneously to complete complex tasks. But even if these individual agents can reach 99% reliability, complex math is complex math, and that 1% will matter after one agent, rather than 20.
Research lags behind practice
As we have seen recently OpenClawbuilding something is faster than making it safe. Now the AI models are building themselves, and research on their interactions cannot continue until the models are made available to researchers. It’s the same delay when building tools for AI, because you hope the models aren’t powerful enough to wear out your tools by the time you release them. And with multi-agent tools, that time has come.
A recent paper challenges the multi-agent hierarchy
Self-organizing LLMs have outperformed many
Well, the paper in question is “Abandon hierarchy and roles: how self-organizing LLM agents outperform designed structures,” and it’s interesting not just for the results, but how comprehensive it is they tested their hypothesis. They used 25,000 tasks involving eight LLM models with four to 256 agents and eight coordination protocols. The results showed the best improvement with a hybrid approach, where the rough structure was mapped, but individual agents could organize themselves to fit their roles.
The practical implication: give agents a mission, a protocol and a skill model rather than a predefined role.
Now, this is not to say that there is no value in the coordinator model. Obviously there is, otherwise the hybrid models wouldn’t win. It is similar to operational engineering, but applied at scale. Giving autonomous agents a mission to achieve, a protocol to follow, and an appropriate model to use is no different than asking a chatbot, but then no human intervention is required.
But they also uncovered many other nuggets. Less capable models like the GLM-5 worked better with rigid, prescribed roles and orchestral hierarchy. Powerful models such as Claude Sonnet 4.6 and DeepSeek v3.2 performed best with minimal instructions, and open-source models were within 95% of the performance of closed-source models, which can reduce costs without sacrificing quality.
Self-organizing agents are far more accurate
Although the hybrid model is currently at the fore, LLM research is developing at an increasingly rapid pace. Single agents can create sub-agents and organize their own small workforce, and with more capable models there is the potential that more “how” of each task must be worked out. While this is impressive, it’s also fascinating to see how computers organize their workflows compared to the organizational charts that humans have tweaked over the years.




