Member-only story

Engineering Resilience: Lessons from Amazon Search’s Chaos Engineering Journey

Discover how Amazon Search enhanced its system resilience through practical resilience engineering, transitioning from traditional load tests to large-scale experiments and achieving key milestones along the way.

--

Authors: Gorik Van Steenberge, Adrian Hornsby, Milosz Kosmider and Takieddine Sbiai

Originally publish here.

Our previous blog post provided an overview of how Amazon Search combines technology and culture to empower its builder teams, ensuring platform resilience through chaos engineering. In this follow-up, we will address questions and feedback received from the previous post. We will discuss our Search Resilience team, detailing our progression from running load tests in the production environment to adopting chaos engineering and conducting numerous large-scale experiments. We’ll also explore the journey that brought us to this point and delve into the practical aspects and key milestones of this implementation.

The first Search GameDay in production

Amazon Search owns the product search pages for Amazon mobile apps and websites worldwide. Serving search pages depends on a complex distributed system consisting of dozens of critical services: from the actual information retrieval search engine to rendering search pages with product images, pricing, delivery, and related information.

Figure 1. Amazon Search returns over 1,000 results for “chaos engineering”

There are dozens of teams building product search, but today we will talk about our Search Resilience team that improves the resilience of Amazon Search by running chaos experiments in production at scale and driving and promoting resilience initiatives. The Resilience team is part of a larger Operational Excellence organization within Search with the vision to make it effortless for service owners to run their services in production, so they can focus on their primary mission of improving the customer experience by providing the most relevant search results as fast as possible.

Many years ago, our team was scaling up Search services to prepare for a sales event with a large projected…

--

--

Adrian Hornsby
Adrian Hornsby

Written by Adrian Hornsby

Former Principal Engineer @ AWS ☁️ I break stuff .. mostly.

Responses (3)