Make your business more resilient in the digital age

7 min readOct 17, 2019

The guest post below, written by my good friend Ricardo Sueiras, principal evangelist at AWS, discusses resiliency and chaos engineering in the enterprise world. I am sure you will enjoy!

— Adrian;

“You don’t choose the moment, the moment chooses you! You only choose how prepared you are when it does.”
Fire Chief Mike Burtch

It was a crisp autumn morning, the first properly cold morning that I could remember for some time, as I stood assembled with colleagues at a gathering point following a fire alarm. Given the number of people to be accounted for and marshaled to a place of safety, it all went smoothly — the result of countless training sessions, tests and drills. Mike Burch’s quote is famous, and it helps you understand the value of training and drills.

Huddling for warmth in the cold air, my colleagues and I started discussing recent stories of high-profile businesses that had been impacted by major system or application failures. Whether it’s the loss of mobile phone service following an outage from a major telco [4]; having your travel plans severely disrupted when an airport computer system fails; or the challenges that many customers have had accessing their financial institutions [3]; such failures are happening with increasing frequency across different sectors and industries.

Why — with all the business continuity planning these days — is this happening?

As many businesses introduce digital initiatives that customers grow to depend on, any downtime can have potentially damaging impact on the bottom line. Recent figures underscore this point:

98% of organizations say a single hour of downtime costs over $100,000
81% of respondents indicated that 60 minutes of downtime costs their business over $300,000
33% of enterprises reported that one hour of downtime costs their firms $1–5 million [2]

The CEO of a major airline recently shared that technological failures that stranded its customers in May 2017 cost the company £80 million. Downtime is becoming a major KPI for engineering teams, and increasingly, a board room concern.

“Everything fails all the time” — Werner Vogels, CTO at Amazon.

Delivering business value on a more frequent and consistent basis requires the business to learn and gain new capabilities. As architecture evolves to take advantage of new technology and innovations that provide those capabilities, systems are becoming more complex.

As change velocity increases and systems become more distributed , how do you keep things running and performing optimally so that you don’t impact your customers or your business?

Cloud technology, such as AWS Cloud, enables organizations to architect and build globally resilient architectures that support the rate of change that digital businesses expect. One of the biggest obstacles you may find on your journey to creating more robust systems isn’t technology, but cultural and political challenges.

For example, there are deep cultural challenges within engineering communities that varies between industries and geographies. Some teams don’t have the flexibility to simulate disasters on a system because real-life disasters are occurring so rapidly that they might be spending all of their time triaging rather than trying to get ahead of the situation. Testing can also be political. Finding the points of failure in a system might force deep conversations about a particular software architecture and its robustness in the face of tough situations. A particular team might be deeply vested in a specific technical roadmap (e.g. microservices) that tests show may not be as resilient to failures as they originally predicted.

How can organizations create more resilient architectures and invest in areas that address this, such as chaos engineering? Business leaders should aim to create a culture of operational excellence and build in mechanisms such as the correction of errors or fitness functions that embrace a data-driven, non-emotional approach to engineering.

How do you ensure your technical staff have the opportunity to learn and skill up, your architects begin adopting new design patterns, and your management understands and are bought into new concepts? People and culture are key to success. Adrian sums this up nicely:

“One important thing to realize early on, is that building resilient architecture isn’t all about software. It starts at the infrastructure layer, progresses to the network and data, influences application design and extends to people and culture.”

The executive board plays a critical role in addressing the cultural and political challenges of instilling change. The business impact of not investing sufficiently in areas such as resilient application architecture and building new competencies such as chaos engineering and operational excellence are clear. In this blog post, I’ll share why this is important not just for the CIO and operational parts of the business, but to the executive board.

Regulatory compliance

At the beginning of this post I shared some real-world examples of businesses in regulated industries that have caused significant disruption to their customers. Whether it’s finance, healthcare, communications or transportation, the failure of these businesses to make their applications resilient has prompted concern from regulators who have said they will look into [6] the outages customers experienced. Service disruptions will inevitably happen, but the ability to recover without customers being affected is vital for financial stability.

Executive takeaway: architecting and building using cloud native principles can mitigate any new regulations that will impact the business, allowing it to respond quickly to those changes.

Building more resilient digital business

The way to build a resilient digital business is to ensure you take a ground-up approach to resiliency and that you test and practice it on an ongoing basis. A key element of planning should include taking steps to limit blast radius and collateral damage so that when you do have a problem, it can be contained and managed.

Most products or services are tested during the development and deployment phases, but you need to think beyond traditional testing and expand how you test for “unknown unknowns.” I remember years ago when we were testing core infrastructure resiliency we would use a “low fidelity” test of unplugging a cable from one server or connecting it to a different port to see what happened. On many occasions our systems did not react the way we expected, so it was back to the drawing board. However, it was also one less thing that would cause our customers pain.

This is where the discipline of building resilient architecture as well as using principals such as chaos engineering to test and validate those architectures and understand blast radius can help IT better understand and mitigate the impact of different types of failure.

Executive takeaway — building resilience will reduce customer impacting incidents, and the potential for more widespread brand and financial damage.

Better customer experience

While your business is creating better customer experiences through the use of digital technologies such as the cloud (above the line), ensuring that experience is not impacted by availability (below the line) is critical. Adrian Hornsby says: “Chaos engineering is not about breaking things randomly without a purpose. It’s about breaking things in a controlled environment through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.” Ultimately, what this translates to is reducing disruption for your customers when failures or unexpected events occur in your environment.

Executive takeaway — increase your net promoter scores by creating trust in your products and services.

Talent retention and acquisition

According to 2018 PwC CEO survey, [5] over 70% of CEOs are concerned about the shortage of digital skills in their industry and 50% said it was somewhat or very difficult to attract digital talent. Investing in your existing talent is a great way to retain them and build the skills necessary to address the other areas mentioned above. Mechanisms such as Game Days, which you can think of as fire drills, help everyone work under pressure scenarios — but in a competitive situation rather than life or death one. If this is done regularly, when (and it will always be when) things do happen, your digital immune system will be able to respond more effectively. Sharpening your team’s operational skills in this way will help them focus more on prevention than fixing issues. Moreover, it will test your monitoring and alerting systems — a task rarely undertaken in real life.

Executive takeaway — provide opportunities for your people to reach their full potential — then ensure you retain them. This will create a ripple effect of a place that great people want to be part of.

Strengthening your security posture

Attacks against a business’ cyber defenses represents one of the clearest definitions of “chaos”. Should an attacker be successful, it could lead to significant business damage. Security should be your number one priority, so how do you ensure that as a business, you are able to deal with the unexpected or the unplanned?

As systems become more decentralized and take advantage of evolving cloud native architectures, this task is even more important to bake in from the start. Businesses need to learn new skills, such as chaos engineering, that will harden their organizational immune system. If we examine how some industries approach this — for example, the firefighters mentioned at the beginning of this post or medical staff working in emergency response units — their training regimens are designed to improve their ability to deal with more predicable outcomes when required.

Applying chaos engineering techniques will help change the mental model from focus on a specific (sometimes narrow) attack vector to the overall design and vulnerability of a system.

Executive takeaway — raise the bar in how you are able to respond to and protect your business from known and unknown security vulnerabilities.

Putting “chaos” in the “C” of CEO

“By failing to prepare, you are preparing to fail.” — Benjamin Franklin

Given the examples of the high-profile disruptions, business can no longer afford to stand still. They must evolve their operational and engineering practices and architect their systems with operational excellence in mind.

It’s important that the executive function understands that this is not just a technology issue but a broader business issue — and one that has a significant impact on people and culture. It’s about building confidence in the products and services your deliver to your customers and ensuring that you don’t become yet another case study on the business impact of a technology failure.

-Ricardo

References:

[1] https://www.usenix.org/blog/gameday-creating-resiliency-through-destruction — quote from Jess Robbins
[2] https://www.evolveip.net/blog/idc-statistics-financial-impact-unplanned-downtime / and https://www.randgroup.com/insights/cost-of-business-downtime/
[3] — Banking outages article in indépendant
[4] — O2 outage
[5] — PwC CEO Survey 2018 — https://www.pwc.com/gx/en/ceo-survey/2018/deep-dives/ceo-survey-financial-services-talent-report-web.pdf
[6] https://www.theguardian.com/money/2018/jul/05/banking-outages-should-be-limited-to-two-days-say-regulators
[7] https://www.theregister.co.uk/2018/11/23/treasury_committee_probes_banking_it_failures/