Amazon Web Services (AWS) has published a post-mortem of its Easter cloud outage that provides the IT industry a unique opportunity to study what technologies the world's largest provider of cloud computing uses to provide resilient services.
The announcement also provided a detailed description of how the AWS cloud is designed, what went wrong during the outage, and what the industry can learn to better prepare similar events in the future.
The trigger for the outage - referred on Amazon's service status page only as a "network event" - was a mistake made during a scheduled upgrade of capacity on the primary network for Amazon's Elastic Block Storage (EBS) service, which underpins AWS.
The mistake caused all traffic that would normally use the primary, high-capacity network to instead use a second, lower-capacity network designed for reliable communications and overflow capacity. The secondary network was quickly overwhelmed, which triggered a cascade of other issues, resulting in a "re-mirroring storm".
EBS consists of clusters of storage nodes, connected in a peer-to-peer fashion, with each node storing a replica of EBS data "volumes". These volumes are used for data read and write operations.
EBS clusters are grouped into Availability Zones where each Zone contains, it appears, a single EBS cluster. Availability Zones are further grouped into geographical Regions, where each Region operates independently.
Each cluster also runs a variety of services that are collectively referred to as the "control plane", which is distributed across Availability Zones in the Region to provide availability and fault tolerance.
This technical description of EBS clusters resembles the architecture of IBM's XIV storage array, where portions of data are replicated to other physical locations, and if a node or disk fails, new replicas are built to ensure no data is lost. Google's GFS "chunks" is another example of the same technique.
When a node fails, and re-mirroring is required, EBS uses the control plane to find a suitable target node to replicate data to. During the outage, a large number of EBS nodes assumed replica destinations had failed because they lost network connectivity, causing a storm of control messages.
As previously described on iTnews, the control plane essentially experienced a distributed denial-of-service attack.
Other AWS services, such as the Relational Database Service (RDS) and Elastic Compute Cloud (EC2) instances, rely on the EBS control plane to, for example, know where the primary, writeable copy of a data volume is located. This was particularly challenging for RDS instances, which often make use of numerous EBS volumes; if any EBS volume was affected, the RDS instance would also be affected.
Amazon's outage demonstrates that if an application requires superior uptime, relying on a single AWS Region places it at risk during a control plane outage.
Amazon has in response to the outage committed to making it easier for customers to take advantage of multiple Availability Zones, and to better isolate the control plane from issues in any one Availability Zone. Still, for maximum resilience, application designers should consider making use of multiple regions, though this will no doubt add complexity and cost.
Amazon have used a variety of standard industry techniques (clustered servers, redundant networks, separation of data and control), coupled with some clever application programming to provide cloud services that are unparalleled in their stability and performance.
And yet, as we have seen, perfection eludes us, as human error can expose design flaws that have widespread and costly consequences.
However, by providing this detailed description of how AWS, and EBS in particular, works, Amazon has provided a tremendous opportunity to ensure that IT architects design applications to cater for the possible failure modes of the infrastructure on which they run.