Claude Code’s “/goals” separate the agent that executes from the agent that decides its execution



The code migration agent finishes and the pipeline appears green. But several pieces were never compiled – and took days to catch up. This is not a model failure; it is an agent who decides that it is done before it is actually done.

Many enterprises are now finding that AI agent production pipelines fail not because of the models’ abilities, but because the model behind the agent decides to stop. In LangChain, Google, and OpenAI, there are several ways to prevent tasks from running out of time, although they often rely on separate scoring systems. The newest method comes from Anthropic: /Objectives on the Claude Codewhich formally separates task performance and task evaluation.

Coding agents work in a loop: they read files, execute commands, edit the code, and then check whether the task has been completed.

Claude Code / goals actually adds a second layer to that loop. Once the user sets a goal, Claude will continue to rotate, but after each step, an evaluator model enters to review and decide whether the goal was achieved.

Two models were separated

All three vendors’ orchestration platforms identified the same obstacle. But the approach to them is different. OpenAI leaves the loop alone and lets you decide when the model ends, but allows users to flag their own estimators. Standalone evaluation is possible for LangGraph and Google’s Agent Development Kit, but requires developers to define a critical node, write termination logic, and configure observability.

Claude Code / goals set the default of the independent estimator regardless of whether the user wants it to run longer or shorter. Basically, the developer sets the goal completion condition through a query. For example /goal all tests test/auth pass and lint step clean. Claude Code then kicks in and the evaluation model, which defaults to Haiku, will check the condition loop every time the agent tries to finish. If the condition is not met, the agent continues to run. If the condition is met, it inserts the obtained condition into the agent conversation transcript and clears the goal. There are only two decisions the evaluator makes, so the smaller Haiku model works well whether or not it is designed.

Claude Code makes this possible by separating the model that attempts to complete the task from the evaluator model that ensures that the task actually completes. This prevents the agent from confusing what has already been done with what still needs to be done. With this method, Anthropic noted, there is no need for a third-party monitoring platform — although businesses are free to continue using one alongside Claude Code — no need for a dedicated log, and less reliance on post-mortem reconstruction.

Competitors such as Google ADK support similar evaluation patterns. Google ADK deploys LoopAgent, but developers must architect this logic.

In its documentation, Anthropic said the most successful conditions typically have:

  • One measurable end state: test result, build exit code, file count, empty queue

  • Specified validation: How Claude should prove this, eg “npm test exits 0” or “git status is clean”.

  • Important restrictions: anything that should not change on the way there, eg “no test files changed”

Reliability in the loop

For businesses that already manage expanding tool stacks, the appeal is a native evaluator that doesn’t add another system to maintain.

It’s part of a broader trend in the agency space, particularly in government, long-term and self-learning agents becomes more of a reality. Evaluative models, verification systems, and other independent judgment systems are beginning to appear in reasoning systems and, in some cases, encoding agents such as Devin or SWE-agent.

Sean Brownell, Sprinklr’s director of solutions, told VentureBeat in an email that there is interest in this kind of loop where the task and referee are separate, but he feels there is nothing unique about Anthropic’s approach.

"Yes, the loop works. Separating the builder and the judge is good design, because in principle, you can’t rely on the model to judge its own homework. The model doing the work is the worst judge of whether the work will be done or not." Brownell said. "That being said, Anthropic isn’t the first to hit the market. The most interesting story here is that two of the world’s largest artificial intelligence labs sent the same command a few days apart, but each of them came to completely different conclusions about who to declare “done”."

Brownell said the loop works best "for deterministic work with a verifiable end state such as migration, fixing broken test suites, eliminating backlog," but for more nuanced tasks or those that require design judgment, it’s more important to have a person make that decision.

Bringing this evaluator/task division down to the agent-circuit level shows that companies like Anthropic are moving agents and orchestration toward a more auditable, observable system.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *