
Enterprise teams continue to watch the same thing happen. An AI agent performs beautifully, goes into production and stops: it works for a short time, then needs a human to increase its context and inspect the product, and the promised efficiency flows into control. The agent did the job; you watched This is one reason why many agent pilots never make it to production systems.
The pitch on the other side of that wall is the one every team wants to believe in: an agent who handles a long case single-handedly, overnight if necessary, leaving only the human to approve the last 10%. Whether or not this is possible causes a problem, the orchestration conversation is mostly skipped. When AI firm Chroma tested 18 leading models, each lost accuracy as input increasedA feature of how attention works is not a gap that a stronger model closes. An agent that feeds your work more and more does not stabilize. It shakes even more.
This is the layer below the orchestra competition. Routing, continuous execution, and observability assume that each agent is competent enough to coordinate in the first place. The deeper question is how long an agent can work before a person starts, and that depends on where your company’s knowledge resides relative to the model. Both standard fixes leave people in the loop.
Why teaching your business a model keeps you in the loop
Frontier models continue to become more capable and the gap is not closing because it is not a capability issue. It’s about your knowledge of the model and where the businesses have it two ways to place there.
The first is fine-tuning, which translates knowledge into weights. It is subject to catastrophic neglect, a problem identified in the 1980s and Still not resolved in 2026: teaching a model something new erodes what it already knows. Teams work around each task by isolating it in its own fine-tuned model or adapter, which produces a large set of models. increases costs and management costs. And the fine-tuned model is an outdated picture by the day policy changes, when the expensive, slow retraining period begins.
The second is in-context learning, which skips retraining by embedding relevant policies into the query at runtime. This is where the context rot bites. The search narrows down what is included in the query, but the search miss appears to be the same as the confident answer, and both cost and latency increase with each token added.
Two failures rhyme. With fine-tuning, the model can work with confidence from the politics of the last quarter. With learning in context, he can confidently work through a detail he lost in the middle of a lengthy survey. Either way the output looks equally reliable, so you can’t tell which parts are wrong without checking them all. That is why one can never leave. Some teams often work on both at the same time, refining the fixed knowledge and buying the rest. This mitigates each failure, but doesn’t eliminate any: at any given output, you still can’t be sure that the model is running in both the current and correct contexts, so you check it.
The third way: create an on-demand expert model
The third approach is the transition from research to initial product. Instead of retraining a model or populating its command, the generator builds a small, task-specific model on demand from your policies. A generator is a hypernetwork: a network whose output is the weights of another network.
It was an idea Named in 2016; its application to create expert language models from text or documents is new and active. Sakana AI Text to LoRAIntroduced in ICML 2025, it creates a model adapter from a plain language description in one pass and calls a 2026 system hypergrid adapter called SHINE. a promising new frontierprecisely because it removes both the retraining costs of fine-tuning and the constraints of the context of the proposition.
Rather than training and maintaining adapters, the point of creating them is to assemble a large library of LoRAs for each task into a network that can produce them on demand, including tasks it has not seen.
The neat part is how it closes the loop on the above problem: the adapter groups for each task are the same object that the hypergrid automatically creates to avoid catastrophic forgetting. The model zoo ceases to be a management headache and becomes a created product.
Underneath it all, the claim to be small was most directly placed on the 2025 paper Nvidia researchers: for the narrow, repetitive tasks that populate agent workflows, small models are quite capable and 10-30 times cheaper than frontier generals. Nace.AI, a Palo Alto company $21.5 million seed round in Mayis the clearest commercial example. Its core technology, a generator it calls MetaModel, produces parameter fits for a model during inference from company policies that indicate regulated work: audit, compliance, risk assessment. The company says its agents handle most of the workflow, while human experts validate the output, 90/10 to market.
A comparison of three approaches
|
Fine tuning |
In context / RAG |
A model generated by a hypergrid |
|
|
Where business knowledge lives |
In the weights of the model |
Quickly, re-equip every run |
In weights made on demand |
|
Cost to renew on policy change |
High: retrain |
Bottom: edit source |
Down: restore |
|
Staleness |
Top: picture |
Down |
Bottom: Restored from current policy |
|
Cost per call and delay |
Down |
High, it grows with context |
Down at work |
|
Dominant failure mode |
to forget model zoo |
Context decay; silent search misses |
Generator quality; calibration |
|
Who owns the improving asset? |
Who teaches the model |
Whoever holds the data warehouse |
It depends on where the generator and feedback live |
Why does the hypergrid-based model raise the bar for autonomy?
There is a smaller surface where a narrow, current and small model will go wrong. Fewer errors confined to a known domain means fewer consequences for the agent to convey to the individual, which is the real basis for any high autonomy claim. This is also where a number like 90/10 comes in: it’s not a pre-set sum, but a result of how little the system is handing back. The reported autonomy shares are best read as dimensions of the architecture rather than as parameters.
Two design choices decide whether this autonomy is reliable or just fast. The first is justification: linking each output to the source so that the reviewer can check instead of repeating. This is exactly what research models are built for, e.g HalluGuardlabel each claim as supported or unsupported and cite the passage they refer to. NACE sends its agents with reasoning models and reasoning traces for the same reason. 10% feedback only means something if a person can confirm its origin in seconds.
The second is the feedback loop, and it raises the question every buyer should be asking: when your experts approve the product, whose model improves and where does it live? This decides whether the compound asset belongs to the seller or to you. The settings are different. For example, Nace uses an external network of certified experts for some tasks and the client’s own staff for direct enterprise deployments, resulting in the model being stored in the client’s cloud. Each choice takes learning and ownership to a different place.
Where the third road intersects
The approach is still early, and several questions will decide how far it goes. Calibration is the linchpin: the value is based on a model that knows it is uncertain. And this is indeed uncertain, as recent work creating these adapters has found that they do not automatically improve calibration over conventional fine-tuning, with gains appearing only under specific constraints.
The quality of the generated model is also highly dependent on the policy data on which it is built, favoring data curation. And scale is an open research frontier, so far the hypergrids shown in published work have been small. Here’s where Nace’s work gets interesting: in our interview, the company says it pushed its generator well beyond published metrics and how performance grew, began sharing results with the public, and is now rolling out a scaling law for peer review. If it holds up, it would help answer one of the open questions in the field, and it’s a paper worth looking at.
Whichever approach wins, the work still ends up in the human, and this handover is a problem of its own design. When Deloitte Australia presented a government report worth about 440,000 Australian dollars, it sent with fictitious citations and fictitious court citation after high-level review, because reviewers were examining healthy outcomes, not origins. Controlled research suggests that the pattern is general: experts made the same flawed recommendation even less when tagged by artificial intelligence.
EU AI Act Article 14 now calls this automation trend. The lesson is not about any salesman: a high degree of autonomy focuses one’s attention on a thin, late part of the work, so the value of this view depends entirely on one’s ability to quickly check the origin, which turns back to justification.
What to build and what to ask before buying
The honest way: it’s usually not orchestration or model size that’s holding your agents back, it’s whether the model knows your job well enough to be left alone, and the right tuning depends on the job. To automate a long, repetitive, high-volume process end-to-end, to run most of your internal audits overnight, and to have your own experts do the final slice checking, a hyper-networked model is an approach that is likely to be done cheaply and work long enough. For a short task that is completed in a few steps and never needs to be run unattended, the gap between this and a well-designed boundary model narrows almost to nothing and is not worth the integration costs.
When a seller introduces autonomous or specialist agents, four questions cut it.
-
Where does business knowledge live: in weights, operationally, or generated on demand?
-
What does each output come with so the reviewer can check it instead of repeating it?
-
What decides which job is promoted to a person?
-
And whose model improves from this feedback and where does it work?
The answers tell you what you’re getting, not the headline ratio.
The hypergrid approach is the most reliable attempt to know a particular case without forgetting a small model and explaining it again every time. It is also the least proven, and the parts that matter most, calibration and scale, are still under review. Test it now for proper working. For error, the integration cost costs you less than a well-proposed frontier model would.





