Chaos Engineering in the Age of AI: Surfacing Hidden Complexity
“The complexity of things — the things within things — just seems to be endless. I mean nothing is easy, nothing is simple.” — Alice Munro
The rise of AI in software development presents a fascinating paradox. While AI tools make it easier than ever to generate complex systems rapidly, they also make it harder to understand how these systems actually work.
This challenge isn’t entirely new — forty years ago, Lisanne Bainbridge warned us about it in her paper “Ironies of Automation.” She argued that the more we automate systems, the more we need human expertise to handle their inevitable failures. AI is now accelerating this irony to unprecedented levels.
Consider this recent case where a developer built a Python project entirely using AI tools. The project quickly grew to over 30 files, but as complexity increased, things started to break down. The AI couldn’t effectively optimize the code or fix bugs because it had lost track of the system’s underlying structure.
This illustrates what’s becoming known as the “70% problem” in AI-assisted development: while AI can quickly generate the first 70% of a system, the remaining 30% often becomes a struggle due to unforeseen complexities and bugs that come from AI-generated outputs.
This represents a fundamental shift in how complexity emerges in software systems. Traditionally, complexity grows gradually, allowing developers to build intuition about potential failure points over time.
With AI-generated systems, that complexity is present from day one, but our understanding isn’t. Even more concerning, these systems are particularly prone to accumulating hidden technical debt, issues that compound silently over time and affect system-level behavior rather than just code-level problems.
One proposed solution is the emergence of “meta-operators,” AI agents designed to monitor and manage other AI systems’ outputs and behaviors. This might seem like an elegant solution: fighting complexity with more AI. But this approach, while powerful, can’t replace this fundamental need for deep system understanding. They might actually amplify Bainbridge’s paradox and make it harder to debug by adding another layer of complexity. Each layer of AI abstraction makes human expertise simultaneously more critical and more difficult to maintain. When failures cascade through multiple layers of AI systems, someone still needs to understand how it all works together.
To address these layers of hidden complexity, we need a systematic approach to understanding how our systems actually behave. This is why chaos engineering is becoming increasingly relevant.
Chaos engineering is a systematic process that involves deliberately subjecting a system to disruptive events in a risk-mitigated way, closely monitoring its response, and implementing necessary improvements. Instead of leaving these events to chance, chaos engineering empowers engineers to orchestrate experiments in managed environments, typically during periods of low traffic and with readily available engineering support for effective mitigation.
Think of chaos engineering as a compression algorithm for experience. Instead of waiting years to encounter various failure modes naturally, we can proactively surface them through controlled experiments. By running these experiments during planned conditions, not at 3 AM on a weekend, teams can discover issues before they become critical failures.
With AI-generated systems, chaos engineering reveals patterns and dependencies that even the generating AI didn’t account for. By systematically testing failure modes, we can map the actual behavior of these systems against their intended design, often discovering critical gaps in our assumptions. Through careful observation, we often find surprising disconnects between intended and actual functionality.
Going forward, we must find the right balance between AI-generated complexity and human understanding. While AI enables us to build more sophisticated systems faster than ever, and AI meta-operators might help manage them, we cannot escape the fundamental need to understand them deeply.
The challenge ahead isn’t just about building systems with AI but also about maintaining the human expertise needed to understand and repair them when they inevitably fail. As Bainbridge warned, the irony of automation is that it simultaneously makes this expertise more critical and more difficult to maintain. Adding layers of AI meta-operators may help manage day-to-day operations, but it also deepens this paradox.
As we build more AI-generated systems and layer AI operators to manage them, chaos engineering becomes essential, not just as a resilience-focused tool but as our window into understanding these increasingly abstract systems. It helps us maintain the human expertise that Bainbridge recognized as crucial forty years ago, and that becomes even more vital as AI continues to transform how we build and operate software.
- Adrian