If you've inherited a load testing setup for an AI agent, you probably have dashboards showing p99 latency, requests per second, and error rates. Those numbers look good. They're also almost entirely irrelevant to the reliability risks your agent actually carries.
Load testing was designed for stateless services. Send N requests, measure how the system degrades. The implicit assumption is that 'correct' is binary and verifiable at the HTTP level — a 200 is a success, a 500 is a failure. That assumption breaks completely for AI agents.
The three gaps load testing can't cover
Gap 1: Correctness at scale isn't the same as correctness under failure
Load tests probe a system under high throughput. AI agent failures happen under partial degradation — a tool that returns a slightly different schema, a downstream service that times out after 3 seconds instead of 1, a status field that contains 'COMPLETED' instead of 'completed'. These are not load conditions. They are fault conditions, and they require a different testing model.
Gap 2: An agent can succeed at HTTP and fail at the outcome level
Suppose your support agent calls a ticket system, receives a 200 with a response body that has a renamed field, doesn't notice the rename, and closes the ticket as resolved. Every metric in your load test dashboard is green. The customer's issue is still open. The agent produced a plausible-looking response that was operationally wrong.
This is the category of failure that load testing structurally cannot detect. It requires testing what the agent does — not just whether it responds.
Gap 3: Reasoning errors are invisible to infrastructure monitors
When an agent reasons incorrectly — chains tool calls in the wrong order, acts on a field that was supposed to be ignored, approves an action it should have rejected — it doesn't appear in your error rate. It appears in your business logic, days or weeks later, in ways that are hard to trace back to a specific deployment.
What mutation-based replay testing covers instead
Replay testing starts from a different premise: instead of asking 'can the system handle N requests?', it asks 'does the agent handle degraded conditions correctly?'
- Schema mutations: rename a field the agent depends on, change a type, remove a key. Does the agent gracefully handle the change or does it silently produce wrong output?
- Fault injection: force a timeout, return a 429, inject an error response. Does the agent retry appropriately, escalate, or hallucinate a response?
- Value corruption: corrupt a numeric value, inject boundary conditions. Does the agent catch the anomaly or act on it?
- Policy gates: does the agent attempt irreversible actions without the required approval tokens? Does it call tools it shouldn't in this context?
These are the conditions that produce real incidents. Recording them as a pass rate — '87% of replay attempts passed under fault injection' — gives a meaningful signal that load test results do not.
The right model: use both, for different things
Load testing is still valuable for agents that have high request volume and infrastructure bottlenecks. The point isn't to replace it — it's to recognize that it covers a completely different failure surface.
- Load testing → infrastructure reliability, throughput, latency under traffic
- Mutation replay testing → correctness under failure, policy compliance, schema resilience
The gap most teams have isn't that their load tests aren't thorough enough. It's that they don't have mutation replay coverage at all. That's the surface where AI agent incidents actually originate.
Start with the traces you already have. Replay them under fault injection. The results will tell you things about your agent that months of load testing never could.
