I’d like to start this article by inviting you to join us at the AWS Resiliency and Chaos Engineering Online Series, on October 27-28, 2020 from 11:00am to 2:00pm AEST each day.
This event is packed with great insights on how to build resilient applications in the cloud, with presentations from Canva and Gremlin; a keynote from Adrian Cockcroft, who helped pioneer Chaos Engineering at Netflix, and who now works with AWS; and many other sessions including a couple from myself about immutable architectures and practicing chaos engineering.
Check out the complete agenda online and register to attend!
From Gameday to Chaos Engineering
In the early 2000s, Jesse Robbins, whose official title was Master of Disaster at Amazon, created and led a program called GameDay, a program inspired by his experience training as a firefighter. GameDay was designed to test, train and prepare Amazon systems, software, and people to respond to a disaster by purposely injecting failures into critical systems.
When Netflix started migrating to the AWS Cloud in 2008, they wanted to enforce very strong architectural guidelines. One of those guidelines was how to enforce stateless auto-scaled micro-services. This means that any instance can be terminated and replaced automatically without causing any loss of state. To this purpose, they developed and deployed a tool called Chaos Monkey on AWS that helps applications tolerate random instance failures.
As the collection of tools to enforce different architectural guidelines in their environment grew, Netflix formalised and documented these techniques, and published their famous Chaos Engineering book which I highly recommend to anyone interested in the topic.
Former Netflix cloud architect, now VP Cloud Architecture Strategy at AWS, the aforementioned Adrian Cockcroft, presents this definition of chaos engineering - “Chaos Engineering is an experiment to ensure that the impact of failures is mitigated.”
Indeed, we know that failures happen all the time, and that they shouldn’t impact customers if they are mitigated correctly — so essentially chaos engineering’s main purpose is to uncover failures that are not being mitigated.
The Phases of Chaos Engineering
It’s very important to understand that chaos engineering is NOT about letting monkeys loose or allowing them to break things randomly without a purpose. Chaos engineering is about injecting failures in a controlled environment, through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.
To do that, you have to follow a well-defined, formalised process that will take you from understanding the steady state of the system you are dealing with, to articulating a hypothesis, and finally, verifying and learning from experiments in order to improve the resilience of the system itself.
While it is common to focus on the experiment part of chaos engineering (the failure injection), the before and the after are equally, if not more important, since this is when understanding and learning happens.
1— Steady State
One of the most important parts of chaos engineering is to first understand the behavior of the system in normal conditions - its steady state.
The key here is not to focus on the internal attributes of the system (CPU, memory, etc.) but to instead look for measurable output that ties together operational metrics and customer experience.
It goes without saying that if you can’t properly measure your system, you can’t monitor drifts in the steady state, or even find one. Invest in measuring everything, from the infrastructure, the network, the application, but also the user experience.
2 — Hypothesis
Once you’ve nailed your steady state, you can start making your hypothesis. Example hypothesis are:
- What if this recommendation engine stops?
- What if this load balancer breaks?
- What if caching fails?
- What if latency increases by 300ms?
- What if the master database stops?
Start small and don’t make it more complex than necessary.
When doing the hypothesis, bring the entire team around the table — the product owner, the technical product manager, backend and frontend developers, designers, architects, etc.
Ask everyone to write their own answer to the “What if …?” on a piece of paper, in private. What you’ll see is that most of the time, everyone will have a different hypothetical answer, and often, some of the team will not have thought about the hypothesis at all.
Spend time understanding how the system should behave during the “What if…?”
Experiment on parts of the system you believe are resilient, not the ones you know will break you — after all, that’s the whole point of the experiment.
3 — Design and run the experiment
Start small, and slowly build confidence within your team and your organisation. People will tell you “real production traffic is the only way to reliably capture the system behaviour”. Listen, smile and keep doing what you’re doing — slowly.
Remember “The Tortoise and the Hare” story: slow and steady always wins the race.
Building confidence is key to success. The worst thing you can do is to start chaos engineering in production and fail miserably.
One of the most important things during the experiment phase is understanding the potential blast radius of the experiment and the failure you’re injecting — and minimise it.
Pick one hypothesis. Scope your experiment carefully. Identify the relevant metrics to measure. Notify the organisation. Run your experiment.
4 — Learn and verify
In order to learn and verify, you need to measure. As stated previously, invest in measuring everything! Then, quantify the results and always start with measuring the time to detect.
Do a postmortem for every experiment — every single one of them. At AWS we invest a lot of time making sure every postmortem is a deep-dive on the issues found to understand the reason(s) why failure happened and to prevent a similar failure in the future.
One of the most important guidelines for writing a good post mortem is to be blameless and avoid identifying individuals by name. This is often challenging in an environment that doesn’t encourage such behavior and that doesn’t embrace failure.
5 — Improve and fix it!
The most important lesson here is to prioritise fixing the findings of your chaos experiments over developing new features. Get upper management to enforce that process and buy into the idea that fixing current issues is more important than continuing the development of new features.
The benefits of chaos engineering
The benefits are multiple, but I’ll outline few I think are the most important:
- Build confidence in your application to withstand turbulent conditions and uncover the unknowns in your system and fix them before they happen in production at 3am.
- Validate your monitoring and observability. By injecting failure in the system, you will often find blank spots in your monitoring. Remember that being able to observe, measure, and alarm on failure before they get out of control is critical.
- Improve skills by practicing handling the unexpected, and thus in turn improve recovery time.
- Finally, a successful chaos engineering practice always generates a lot more changes than anticipated, and these are mostly cultural. Probably the most important of these changes is a natural evolution towards a “non-blaming” culture: the “Why did you do that?” turns into a “How can we avoid doing that in the future?” — resulting in happier and more efficient, empowered, engaged and successful teams. And that’s gold!
I’ll leave it there but for a deeper, more expanded version of this article, click here.
Register for the AWS Resiliency and Chaos Engineering Online Series
If this topic interests you, don’t miss our 2-day webinar, the AWS Resiliency and Chaos Engineering Online Series, fast approaching on October 27-28, 2020 from 11:00am to 2:00pm AEST each day.
Register here and we’ll see you online!