The Netflix chaos monkey story is well-known. Less well-known: almost every engineering team knows it, almost none apply it to their AI agents. In 2026, that gap is getting expensive.
Traditional software breaks in predictable ways. A service is either up or down, a function either throws or returns. AI agents break differently. They can appear to succeed — complete HTTP handshakes, return valid JSON — while producing outcomes that are subtly or catastrophically wrong. A payment gets double-processed. A support ticket gets closed without resolution. A code review approves a security regression.
What makes AI agents different
Three properties of tool-using agents create failure modes that traditional testing misses entirely.
- Non-determinism. The same input, run twice, can produce different tool call sequences, different outputs, and different downstream effects. Unit tests that pass today may not characterize behavior under slightly different conditions tomorrow.
- External tool composition. Agents don't execute in isolation — they call real APIs: Stripe, Jira, Salesforce, your internal services. Any of those tools can timeout, change schema, return unexpected status codes, or return plausible-looking data that is nonetheless wrong.
- Silent failure. Unlike a 500 error, an agent that reasons incorrectly rarely announces itself. It completes. It returns a result. The failure only surfaces later, in production, in a bug report, or in a customer complaint.
Chaos engineering for software addressed the first category of failure: what happens when a dependency goes down? Chaos engineering for AI agents needs to address all three: non-determinism, partial tool failure, and reasoning-level errors that look like success.
What chaos engineering means for agents
Applied to AI agents, chaos engineering has three phases that map closely to how reliability teams already think about distributed systems.
1. Record real traces
The starting point is a faithful record of what your agent actually does in production. Not a mock — a real trace of model calls, tool calls, arguments, and results. This is your ground truth. It captures the specific tools your agent uses, the schema it depends on, and the reasoning patterns it applies.
2. Replay under controlled fault injection
With a trace, you can replay it in a controlled environment and inject specific failure conditions: rename a field your agent depends on, force a timeout on a payment tool, return a corrupted status. This is mutation testing — you are testing whether your agent handles degraded conditions correctly, not whether it works when everything is perfect.
3. Enforce policies and gate deployments
The output of replay is a pass rate and a set of policy violations. An agent that passes 95% of mutations is more reliable than one that has only been tested against a happy path. Blocking a deployment when the pass rate drops below a threshold is the same discipline as blocking on failing unit tests — it just applies to a different failure surface.
Why now
Tool-using agents are no longer experimental. They are running in production, handling customer-facing workflows, making decisions with real financial and operational consequences. The teams building them are borrowing reliability practices from distributed systems engineering and finding that most of them transfer — but the tooling hasn't caught up.
A unit test suite that achieves 90% coverage doesn't cover the behavior of your agent when Stripe returns a 429 with a different error schema than documented. A load test doesn't tell you whether your agent produces correct outcomes under partial failure. Chaos replay does.
The question isn't whether your agent will encounter unexpected tool behavior in production. It will. The question is whether you find out in testing or in an incident.
Sepurux is built for exactly this. Instrument your agent in minutes, replay traces under injection, enforce policies before merge. The chaos monkey for AI.
