Beyond Technical Solutions: What Software Resilience Can Learn from Aviation Safety

Resilience Bites

2 min readApr 27, 2025

Investigations into aviation accidents have shown that human error is a leading factor in all air operator incidents and accidents.

Many problems encountered by flight crews have very little to do with the technical aspects of flying.

Instead, they’re associated with poor group decision-making, ineffective communication, inadequate leadership, and poor task or resource management.

For decades, aviation safety focused almost exclusively on technical training and mechanical reliability. Yet accidents still happened.

The industry eventually recognized that no matter how well-engineered the aircraft, human factors remained the critical variable.

This led to the development of training programs that focus on communication, decision-making, and teamwork rather than just technical proficiency.

Sound familiar to anyone working in software?

We invest tons of resources into:
- Redundant systems
- Automated failover mechanisms
- Self-healing infrastructure
- High-availability architectures
- Automated recovery
- Complex monitoring

Yet, so many outages eventually come down to human factors:
- Engineers who notice issues but don’t raise them because “no one would have listened anyway”
- Teams that normalize deviance because “that timeout spike always happens”
- Communication breakdowns during incident response
- Organizational silos that prevent learning
- Alert fatigue causing teams to ignore critical warnings among the noise
- Knowledge silos that lives only in certain people’s heads

The most resilient organizations understand that resilience isn’t just about preventing technical failures; it’s about building adaptive capacity within teams.

It’s about creating environments where:
- Engineers feel safe to raise concerns
- Teams practice responding to failures through GameDays
- Learning is valued and prioritized
- Communication remains clear under pressure
- Operational experience is valued and retained
- Leadership acknowledges uncertainty rather than demanding false precision
- People are encouraged to question assumptions and challenge the status quo
- Different perspectives are actively sought during design and incident review
- People are empowered to make decisions during incidents without excessive escalation

Aviation transformed its safety record not just by building more reliable aircraft (though that helped), but by recognizing that humans are both the most vulnerable and the most adaptive part of the system.

Perhaps it’s time for software organizations to take these lessons seriously.

— —

If you struggle to improve resilience in your organization, contact me. I’ve spent the better part of a decade helping organizations transform their approach to system resilience and chaos engineering.

--

--

Adrian Hornsby
Adrian Hornsby

Written by Adrian Hornsby

I help software organizations improve resilience and achieve operational excellence | Former Principal Engineer at AWS | Follow for posts on resilience

No responses yet