Member-only story
Towards continuous resilience
How to anticipate, monitor, respond, and continuously learn from failure.
I want to express my gratitude to my colleagues and friends Ricardo Sueiras, Isabel Huerga Ayza, Matt Fitzgerald, Antonio Valle, Aaron Schwam, and Won Huh for their valuable feedback.
Let me ask you couple questions:
Do you remember your first outage?
How smart and confident did you feel back then?
I clearly remember mine. I was very nervous and sweating a lot. I started making mistakes I usually wouldn’t. It was a disaster. I had no idea what I was doing.
It has always been a mystery to me why no one ever trained me to recover from outages. Not the school. Not my employers. No one.
Instead, and I think I can speak for many of us, we learn the hard way when failure happens in production.
Have you ever wondered why outages are so scary, by the way?
One apparent reason is, of course, the cost.