Karpathy's March of Nines shows why 90% AI reliability is nowhere near enough

“When you get a demo and something works 90% of the time, it’s only the first nine.” — Andrey Karpathy

“March of the Nines” frames the common production reality: You can achieve the first 90% reliability with a robust demo, and each additional nine often requires comparable engineering efforts. For enterprise teams, the distance between “typically works” and “works as reliable software” defines adoption.

The complex math behind the March of the Nine

“Every nine is the same amount of work.” – Andrei Karpati

Complex failure of agent workflows. A typical enterprise flow might include: intent analysis, context retrieval, scheduling, one or more tool invocations, verificationformatting and audit record. If a workflow has n steps and each step succeeds with probability pend-to-end success is approx p^n.

In a 10-step business process, success is compounded by the failure of each step. As long as you don’t harden the shared dependencies, related outages (auth, rate limits, bindings) will prevail.

Success at every step (p)	10-step success (p^10)	Workflow failure rate	10 workflows per day	What does this mean in practice?
90.00%	34.87%	65.13%	~6.5 breaks/day	Prototype area. Most workflows are interrupted
99.00%	90.44%	9.56%	~1 every 1.0 days	It’s good for demo, but in real use lags are still frequent.
99.90%	99.00%	1.00%	~1 every 10.0 days	Still, it feels insecure because misses remain common.
99.99%	99.90%	0.10%	~1 every 3.3 months	This is where it starts to feel like reliable enterprise-grade software.

Define reliability as a measurable SLO

“It makes more sense to spend a little more time being more specific in your description.” — Andrey Karpathy

Teams achieve the high nines by turning reliability into measurable goals, then investing in controls that reduce variance. Start with a small SLI set that describes both the model behavior and the surrounding system:

Workflow completion rate (success or clear progress).
Tool call success rate within intervals, with strict circuit checking on inputs and outputs.
Output rate per schema per structured response (JSON/arguments).
Degree of compliance with policies (PII, secrets and security restrictions).
p95 end-to-end latency and cost per workflow.
Return rate (safer model, cached data or human review).

Set SLO targets by workflow level (low/medium/high impact) and manage bug budget so experiences are under control.

Nine goals that reliably add nine

1) Limit autonomy with an open work schedule

Reliability increases when the system has bounded states and deterministic handling for retries, timeouts, and terminal results.

Model calls sit inside a state machine, or DAG, where each node defines the allowed tools, maximum attempts, and success predicate.
Set to stable with idempotent switches so that retries are safe and debuggable.

2) Enforce contracts at each border

Most production failures start as interface slippage: malformed JSON, missing fields, incorrect units, or invented identifiers.

Use JSON Schema/protobuf for every structured output and validate server side before executing any tool.
Use numbers, canonical identifiers, and normalize time (ISO-8601 + time zone) and units (SI).

3) Layer validators: syntax, semantics, business rules

Schema validation captures formatting. Semantic and business rule checking prevents plausible answers that break the system.

Semantic checks: referential integrity, numeric bounds, permission checks, and deterministic joins with ID when available.
Business rules: approvals for write actions, data residency restrictions, and client-level restrictions.

4) Routing on risk using uncertainty signals

High impact actions deserve a higher guarantee. Risk-based routing turns uncertainty into a product feature.

Use reliability signals (classifiers, consistency checker, or second model checker) for routing.
Skip the risky steps behind robust models, additional validations, or human validation.

5) It invokes engineering tools such as distributed systems

Connectors and dependencies often dominate failure rates in agent systems.

Apply per-tool interrupts, jitter-backoff, circuit breakers, and parallelism limits.
Validate versioned tool schemas and tool responses to avoid silent breakage when APIs change.

6) Make search predictable and observable

Search quality determines how relevant your app will be. Think of it as a versioned data product with scope indicators.

Track idle search speed, document freshness, and hit rate on tagged queries.
The ship index changes with canaries, so you know if something is failing before it fails.
Implement least privilege access and redaction at the search layer to reduce the risk of leaks.

7) Build a production evaluation pipeline

The next nine depend on finding rare failures quickly and preventing regressions.

Protect an incident-driven golden set from production traffic and run it on every change.
Enable shadow mode and A/B canaries with automatic fallback in SLI regressions.

8) Invest in observability and prompt response

When failures are rare, the speed of diagnosis and recovery becomes the limiting factor.

Leave traces/intervals at each step, store tool I/O with edited prompts and strong access controls, and categorize each failure into a taxonomy.
Use runbooks and “safe mode” switches (disable risky tools, change models, require people’s permission) for quick mitigation.

9) Send autonomy slider with deterministic feedbacks

Incorrect systems need control, and production software needs a safe way to accumulate autonomy over time. Treat it autonomy Default as button and safe path, not link.

By default, open approval (or approval workflows) are required for read-only or undoable actions, writes, and irreversible operations.
Create deterministic fallbacks: search-only responses, cached responses, rule-based handlers, or upgrade to human view when confidence is low.
Expose safe modes for each tenant: disable risky tools/connectors, force a more powerful model, lower temperatures, and suppress downtime during incidents.
Design redistributable commits: show in-progress status, plan/difference, and allow reviewer to confirm and proceed with exact step with authentication key.

Implementation sketch: bounded pitch winding

A small wrapper around each model/tool step turns contingency into policy-based control: strict validation, limited retries, timeouts, telemetry, and open rollbacks.

def run_step(name, try_fn, validate_fn, *, max_attempts=3, timeout_s=15):

# track all retries under a span

span = start_span(name)

for attempt in range (1, maximum_attempts + 1):

try:

# because the associated delay cannot stop a step workflow

with deadline (time_s):

out = try_fn()

# gate: schema + semantics + case invariants

validate_fn(out)

#way to success

metric ("step_success"name, try = try)

come back

(TimeoutError, UpstreamError) except e:

# transient: retry with jitter to avoid retry storms

span.log({"try": try, "wrong": street(s)})

sleep (cittered_backoff (attempt))

except ValidationError e:

# bad result: try again once in “safer” mode (lower tempo / stricter instruction)

span.log({"try": try, "wrong": street(s)})

out = try_fn ( mode ="safer")

# rollback: keep the system safe when retries run out

metric ("step_backback"name)

"{name} failed")

Why businesses insist on the next nine

Reliability gaps become a business risk. McKinsey’s 2025 global survey reports that 51% of organizations using AI have experienced at least one negative outcome, and nearly one-third report outcomes related to AI inaccuracy. These results raise the demand for stronger metering, guardrails, and operational controls.

Closing the checklist

Select the best workflow, set its completion SLO and instrument terminal status codes.
Add contracts + validators around each model output and tool input/output.
Treat connectors and lookup as first-class reliability work (timeouts, circuit breakers, canaries).
Drive high-impact actions through higher assurance paths (verification or approval).
Turn every event into a regression test in your gold suite.

Nines come through disciplined engineering: constrained workflows, strict interfaces, continuous dependencies, and rapid operational learning cycles.

Nikhil Mungel Over 15 years building distributed systems and AI teams at SaaS companies.

Source link

Karpathy’s March of Nines shows why 90% AI reliability is nowhere near enough

The complex math behind the March of the Nine

Define reliability as a measurable SLO

Nine goals that reliably add nine

1) Limit autonomy with an open work schedule

2) Enforce contracts at each border

3) Layer validators: syntax, semantics, business rules

4) Routing on risk using uncertainty signals

5) It invokes engineering tools such as distributed systems

6) Make search predictable and observable

7) Build a production evaluation pipeline

8) Invest in observability and prompt response

9) Send autonomy slider with deterministic feedbacks

Implementation sketch: bounded pitch winding

Why businesses insist on the next nine

Closing the checklist

Leave a ReplyCancel Reply

Classic Capcom horror games are now playable on Linux’s Proton – here to try them out early

76% of drivers say luxury badges are no longer worth it – that’s what they want instead

Google introduces E2EE Encryption for iOS and Android

The complex math behind the March of the Nine

Define reliability as a measurable SLO

Nine goals that reliably add nine

1) Limit autonomy with an open work schedule

2) Enforce contracts at each border

3) Layer validators: syntax, semantics, business rules

4) Routing on risk using uncertainty signals

5) It invokes engineering tools such as distributed systems

6) Make search predictable and observable

7) Build a production evaluation pipeline

8) Invest in observability and prompt response

9) Send autonomy slider with deterministic feedbacks

Implementation sketch: bounded pitch winding

Why businesses insist on the next nine

Closing the checklist

Leave a ReplyCancel Reply

Trending now

Classic Capcom horror games are now playable on Linux’s Proton – here to try them out early

76% of drivers say luxury badges are no longer worth it – that’s what they want instead

Google introduces E2EE Encryption for iOS and Android