Everything changed when Claude changed: AI blast radius management in production



Our system did one thing, and it did it well: It turned natural language queries into API calls.

Users were analysts, account managers and operations executives. They knew what data they needed, but assembling it manually meant pulling it out of four dashboards, two BI tools, and a Salesforce report builder. With our system, they wrote the survey in plain English. such a request "Generate a city-by-city report on January-March 2026 sales volume for the Northeast region" converted to an API call that the system can act on:

json

{

"description": "The user requested the sales volume for the given date range, here is the API call to get the response",

"api_call": "/api/sales_volume",

"post_body": {

"start_date": "01-01-2026",

"end_date": "31-03-2026",

"region": "northeast"

}

}

The rest of the pipeline was conventional engineering. The system routed the call to the right side – we had integrations with internal reporting portals, Salesforce, and several native services – applied a large language model (LLM) (-generated JSON query) to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a graph in a browser.

By mid-2025, the system was generating several hundred reports per month. These reports were consumed by management and analysts and distributed to external stakeholders. This has become the standard way most teams retrieve ad-hoc data.

The contract between LLM and the rest of the system was a JSON object structured as described in the example above.

json

{

"description": "The user requested the sales volume for the given date range, here is the API call to get the response",

"api_call": "/api/sales_volume",

"post_body": {

"start_date": "01-01-2026",

"end_date": "31-03-2026",

"region": "northeast"

}

}

We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident and 4.0 without incident. When Sonnet 4.5 shipped, we were satisfied with the stability and predictability of LLMs in solving what we believed to be a simple problem. Model improvements it has become commonplace to hit a small version of the behavior library.

Then we took out 4.5. For a meaningful percentage of queries, the model started folding post_body content into the description field. Two failure modes followed.

First, the filter parameters never reached the API. Our system read post_body as the source of truth for the query payload and that field returned empty. The API call was made without a date range or region filter. Depending on the specific API called, the backend either returned the sales volume for all times or all regions, or returned a 500 error.

Second, the model began asking clarifying questions in response. This was new. Previous versions always treated an undefined query at best and returned a structured object. Sonnet 4.5, being more cautious, would sometimes answer with a question instead. Our system had no way to do that. It was built on the assumption that each model call would result in an API call. There was no human-in-the-loop component and no state to hold a partially filled survey. This led to disruption of downstream systems in various ways.

We’re back to 4.0. This was more difficult than it should have been: between the 4.0 and 4.5 deployments, our team added new API integrations that were compatible against 4.5. Model returns meant they each had to requalify against 4.0 under time pressure.

Here’s why the traditional engineering discipline fails

Software engineering is based on the ability to bind the impact of change. When you upgrade a driver or library, you read the release notes to see if breaking changes are expected. Unit tests determine what can be changed. You can use the following property: The system being modified is deterministic enough that its behavior can be predicted, or at least sampled closely enough to give you confidence. Blast radius is limited by construction.

LLM supported systems break this assumption. The component that generates your output is not under your control. You cannot differentiate model version from 4.0 to 4.5. This is a wholesale replacement of the functionality your system depends on.

When we say one, we mean it infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input domain (natural language) and failure modes (the model can do differently) are both unbounded.

The anatomy of failure

A postmortem showed that our query was not always specified correctly. We told the model to return a JSON object with three fields. We have described what each field is for. We have not explicitly stated that the description must be a natural language string and must not contain serialized descriptions of other fields.

Previous versions of the model took this limitation out of context. Sonnet 4.5 looks better "useful" in the formatting options decided that a clarification request or providing the request body in the description made the response more useful. From the model’s perspective, this was a reasonable interpretation of the ambiguous instruction. But it violated the assumptions on which our system was built.

The error was not in the model. The mistake was in our assumption that the model would continue to fill in our specification gaps as usual. Three successful upgrades have trained us to believe that these loopholes are safe.

Structured output routines and tool usage APIs would have caught this particular failure at the schema level. We did not use them for engineering reasons beyond the scope of this article. But schemas only constrain syntax, not semantics. A scheme cannot determine that a disambiguation question should not appear on a system without a disambiguation path, or that a date range should never silently match all times. Circuits solve the easier half of the problem.

Evals-first architecture

The discipline that closes this gap is to approach the assessment package formally – not quickly. system specification. Demand one implementation of the specification. Model one translator. The ratings are the specification itself, and any model or emergency modification is valid if and only if it exceeds them.

In practice, evaluation is threefold: an input, a property that the output must satisfy, and an evaluation function. For our system, the estimate that will hold the 4.5 regression looks roughly like this:

python

def test_description_contains_no_serialized_payload(response):

desc = response("description").down()

forbidden = ("curly", "post_body", "{", "http://", "https://")

claim none (desc symbol for forbidden symbol), \

f"description leaked structured content: {response(‘description’)}"

Several hundred such properties, some written by hand for known-important invariants, some created as regression tests from real production traffic, and some evaluated by an LLM-as-judge for fuzzy qualities like tone, become the gateway. Model enhancements and urgent changes should be treated as pull requests, which should turn the package green before merging.

Assessments are expensive to set up and maintain. They slide as your product changes. LLM-judged evaluation presents its own variation in results. And the package can only catch the failure modes you think you’d identify—you can’t evaluate your path to safety against a category of failure you never imagined. We learned this lesson the hard way: No one on our team had ever written such a claim "the description field should not contain the curl command," because no one would have thought that the model would put a model there.

Assessments are not a silver bullet. They allow you to close the blast radius of change in the only way available when the underlying function is a black box: By tightly selecting the I/O response you really care about, and refusing to apply it when that behavior moves.

Road map

The engineering community has yet to develop a body of knowledge for writing effective assessments. There are no widely accepted standards for what “scope” means in natural language input spaces. CI/CD systems are not designed to produce probabilistic test results. As agents do more autonomous work—writing code, transferring money, planning infrastructure changes—the gap between "the model passed our smoke tests" and "we know what this system will do in production" becomes the central engineering challenge of the next few years.

The teams that close this gap will be the ones that stop treating assessments as quality assurance and start looking at them as actual specifications of what their systems are.

Vijay Sagar Gullapalli is an artificial intelligence engineer at Adopt AI and an inventor with a USPTO patent.

Sarat Mahavratayajula is a senior software engineer at Sherwin-Williams.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *