Member-only story
Engineering Resilience: Lessons from Amazon Search’s Chaos Engineering Journey
Discover how Amazon Search enhanced its system resilience through practical resilience engineering, transitioning from traditional load tests to large-scale experiments and achieving key milestones along the way.
Authors: Gorik Van Steenberge, Adrian Hornsby, Milosz Kosmider and Takieddine Sbiai
Originally publish here.
Our previous blog post provided an overview of how Amazon Search combines technology and culture to empower its builder teams, ensuring platform resilience through chaos engineering. In this follow-up, we will address questions and feedback received from the previous post. We will discuss our Search Resilience team, detailing our progression from running load tests in the production environment to adopting chaos engineering and conducting numerous large-scale experiments. We’ll also explore the journey that brought us to this point and delve into the practical aspects and key milestones of this implementation.
The first Search GameDay in production
Amazon Search owns the product search pages for Amazon mobile apps and websites worldwide. Serving search pages depends on a complex distributed system consisting of dozens of critical services: from the actual information retrieval search engine to…