Enterprise systems are quietly facing a new class of production incidents—one that existing chaos engineering frameworks were never designed to capture. Autonomous AI agents, deployed to detect and remediate anomalies, are initiating actions that resemble controlled chaos experiments. Yet, these events often evade postmortem scrutiny because they don’t fit traditional incident templates. The result? A growing blind spot in enterprise risk management.
As organizations ramp up AI agent deployments, the disconnect between autonomous systems and chaos engineering is becoming impossible to ignore. Recent industry data underscores the scale of this exposure. A 2025 PwC survey found that 79% of organizations now operate AI agents in production, with 96% planning further expansion. Meanwhile, Gartner warns that 40% of agentic AI projects will be abandoned by 2027 due to inadequate risk controls. What these figures overlook are the unclassified incidents quietly unfolding beneath them—where agents act without the human judgment that has traditionally governed chaos engineering.
The missing human judgment in autonomous systems
Traditional chaos engineering relies on a critical safeguard: a human operator making real-time assessments before injecting controlled stress into a system. This judgment call—whether to proceed with an experiment—is absent when an AI agent takes autonomous action. The agent acts on its training data and immediate context, but lacks the broader situational awareness required to evaluate long-term consequences.
Consider a remediation agent that detects elevated latency in a microservice and restarts the cluster. The agent’s response is technically sound within its limited scope, but the downstream effects may be catastrophic. If three other services are handling peak traffic, connection pools are near capacity, and a database is rebuilding indexes, the restart could trigger a cascading failure. The agent’s action, while intended as a fix, becomes the chaos event itself.
This failure mode reveals a structural flaw in how enterprises govern resilience. Most chaos engineering programs—though mature in many organizations—assume human oversight. They rely on game days, blast radius controls, and SLO-gated experiments where a practitioner evaluates whether the system can absorb the perturbation. Autonomous agents bypass this entire process, leaving no opportunity for real-time risk assessment.
Absorb capacity: The blind spot in enterprise resilience
At the core of this issue is the lack of a shared framework for absorb capacity—the real-time measure of how much additional stress a system can tolerate before breaching its service-level objectives (SLOs). Traditional chaos engineering treats absorb capacity as an implicit concept, managed through static thresholds or post-incident analysis. AI agents, however, operate without any mechanism to account for this resource.
To address this gap, a new model called the resilience budget is emerging. This approach treats absorb capacity as a dynamic, consumable resource—one that must be continuously recalculated based on live system signals. Unlike static thresholds that trigger only after a limit is crossed, a resilience budget integrates multiple real-time inputs to provide an up-to-date assessment of a system’s ability to absorb stress.
The resilience budget draws on four key signal classes:
- SLO burn rate: The primary driver, as it quantifies the gap between current system behavior and committed service levels. Even if CPU utilization appears stable, a burn rate five times higher than expected means the resilience budget is effectively exhausted.
- Resource utilization trends: Patterns in memory, CPU, and I/O usage that indicate whether a system is approaching saturation, regardless of absolute thresholds.
- Dependency state: The health and load of downstream services or databases that could be affected by an agent’s actions.
- Traffic patterns: The volume and distribution of requests, which may reveal whether a system is already operating near its limits.
By combining these signals, a resilience budget provides a nuanced view of absorb capacity—one that could, in theory, be integrated into an agent’s decision-making process. This would allow agents to pause or modify actions when system stress is high, reducing the risk of unintended cascades.
The path forward: Integrating agents into chaos engineering
The solution isn’t to abandon autonomous agents but to rethink how they interact with chaos engineering. Organizations must treat agents as potential chaos injectors and design governance frameworks that account for their actions. This requires three critical shifts:
- Unified incident classification: Postmortems must begin categorizing agent-initiated actions as primary causes of incidents, even when the immediate symptom resembles a traditional failure.
- Real-time risk modeling: Tools like resilience budgets should be embedded into agent decision-making pipelines, providing live absorb capacity assessments before actions are taken.
- Cross-team collaboration: Infrastructure, SRE, and agent development teams must align on shared frameworks for evaluating agent actions, ensuring that remediation strategies account for broader system state.
The stakes are high. The AI Incidents Database reports a 21% rise in AI-related incidents from 2024 to 2025, but this likely understates the true scale, as most organizations lack the mechanisms to classify agent-driven failures. Without proactive measures, the next wave of major production incidents may emerge not from external threats, but from the systems designed to protect them.
The question isn’t whether enterprises will adopt AI agents—it’s whether they can afford not to rethink chaos engineering in their wake.
AI summary
Autonomous AI agents are triggering cascading failures in enterprise infrastructure, but most organizations lack frameworks to detect or categorize these incidents. Learn how to close this gap.


