Member-only story
Testing Spot Instance interruptions with AWS Fault Injection Simulator
Chaos engineering on AWS
“Without data, you’re just another person with an opinion.”
― W. Edwards Deming
Couple years ago, I wrote about Operational Excellence (OE) and discussed the three interconnecting elements to operate the technology we build successfully. First, you need great tools. Second, you need complete processes. Third, and arguably the most important one, you need to have the right culture.
I mentioned that OE resembles a habit, a philosophy, a mindset — one that embraces problem-solving, one that values continuous improvement, and one that aims to exceed goals consistently. It’s a way to anticipate, address, and effectively respond to issues. And, for Amazon, it also means doing all of that at a significant scale, where significant can mean thousands of people and millions of servers across the globe.
Above anything, a culture focused on Operational Excellence means that you don’t speculate. You don’t speculate about the security, the performance, the resilience, and the health of your service — or anything else for that matter. You use data. Data alone lets you understand, and verify, what happens to your application when the environment in which it operates…