Member-only story

Engineering Resilience: Lessons from Amazon Search’s Chaos Engineering Journey

Discover how Amazon Search enhanced its system resilience through practical resilience engineering, transitioning from traditional load tests to large-scale experiments and achieving key milestones along the way.

15 min readNov 27, 2023

--

Authors: Gorik Van Steenberge, Adrian Hornsby, Milosz Kosmider and Takieddine Sbiai

Originally publish here.

Our previous blog post provided an overview of how Amazon Search combines technology and culture to empower its builder teams, ensuring platform resilience through chaos engineering. In this follow-up, we will address questions and feedback received from the previous post. We will discuss our Search Resilience team, detailing our progression from running load tests in the production environment to adopting chaos engineering and conducting numerous large-scale experiments. We’ll also explore the journey that brought us to this point and delve into the practical aspects and key milestones of this implementation.

The first Search GameDay in production

Amazon Search owns the product search pages for Amazon mobile apps and websites worldwide. Serving search pages depends on a complex distributed system consisting of dozens of critical services: from the actual information retrieval search engine to…

--

--

Adrian Hornsby
Adrian Hornsby

Written by Adrian Hornsby

I help software organizations improve resilience and achieve operational excellence | Former Principal Engineer at AWS | Follow for posts on resilience

Responses (3)