Analysis: Lifting the covers off Amazon's cloud outage

 

Post-mortem reveals some of the magic under the hood.

Amazon Web Services (AWS) has published a post-mortem of its Easter cloud outage that provides the IT industry a unique opportunity to study what technologies the world's largest provider of cloud computing uses to provide resilient services.

The post-mortem was published to provide a detailed breakdown of what caused major outages on April 21 that took down many popular websites, including Reddit and Foursquare.

The announcement also provided a detailed description of how the AWS cloud is designed, what went wrong during the outage, and what the industry can learn to better prepare similar events in the future.

The trigger for the outage - referred on Amazon's service status page only as a "network event" - was a mistake made during a scheduled upgrade of capacity on the primary network for Amazon's Elastic Block Storage (EBS) service, which underpins AWS.

The mistake caused all traffic that would normally use the primary, high-capacity network to instead use a second, lower-capacity network designed for reliable communications and overflow capacity. The secondary network was quickly overwhelmed, which triggered a cascade of other issues, resulting in a "re-mirroring storm".

EBS consists of clusters of storage nodes, connected in a peer-to-peer fashion, with each node storing a replica of EBS data "volumes". These volumes are used for data read and write operations.

EBS clusters are grouped into Availability Zones where each Zone contains, it appears, a single EBS cluster. Availability Zones are further grouped into geographical Regions, where each Region operates independently.

Each cluster also runs a variety of services that are collectively referred to as the "control plane", which is distributed across Availability Zones in the Region to provide availability and fault tolerance.

This technical description of EBS clusters resembles the architecture of IBM's XIV storage array, where portions of data are replicated to other physical locations, and if a node or disk fails, new replicas are built to ensure no data is lost. Google's GFS "chunks" is another example of the same technique.

When a node fails, and re-mirroring is required, EBS uses the control plane to find a suitable target node to replicate data to. During the outage, a large number of EBS nodes assumed replica destinations had failed because they lost network connectivity, causing a storm of control messages.

As previously described on iTnews, the control plane essentially experienced a distributed denial-of-service attack.

Other AWS services, such as the Relational Database Service (RDS) and Elastic Compute Cloud (EC2) instances, rely on the EBS control plane to, for example, know where the primary, writeable copy of a data volume is located. This was particularly challenging for RDS instances, which often make use of numerous EBS volumes; if any EBS volume was affected, the RDS instance would also be affected.

Lessons learned

Amazon's outage demonstrates that if an application requires superior uptime, relying on a single AWS Region places it at risk during a control plane outage.

Amazon has in response to the outage committed to making it easier for customers to take advantage of multiple Availability Zones, and to better isolate the control plane from issues in any one Availability Zone. Still, for maximum resilience, application designers should consider making use of multiple regions, though this will no doubt add complexity and cost.

Amazon have used a variety of standard industry techniques (clustered servers, redundant networks, separation of data and control), coupled with some clever application programming to provide cloud services that are unparalleled in their stability and performance.

And yet, as we have seen, perfection eludes us, as human error can expose design flaws that have widespread and costly consequences. 

However, by providing this detailed description of how AWS, and EBS in particular, works, Amazon has provided a tremendous opportunity to ensure that IT architects design applications to cater for the possible failure modes of the infrastructure on which they run.

Copyright © iTnews.com.au . All rights reserved.


Analysis: Lifting the covers off Amazon's cloud outage
Storage guru, Justin Warren.
"Any increase in complexity relates directly to an exponential increase in probability of failure. Follow the "KISS" principle."
By realitybites
 
 
 
Comments: 1
realitybites
May 3, 2011 3:29 PM
Any increase in complexity relates directly to an exponential increase in probability of failure.

Follow the "KISS" principle.
Comments have been disabled for this article.
 
 
Top Stories
NBN Co could miss revised June fibre targets
Analysis: Cutting it fine in the race to the line.
 
Review: Sydney's Opal smartcard
It's no Oyster card.
 
Rackspace puts price premium on Aussie public cloud
At least 17 percent more compared to US instances.
 
 
Storage guru, Justin Warren.
Sign up to receive iTnews email bulletins
   FOLLOW US...

Latest VideosSee all videos »

iTnews Academy: Microsoft Windows Server 2012 - Hyper-V
iTnews Academy: Microsoft Windows Server 2012 - Hyper-V
Interview: Australia's 'cloud-last' policy is dangerous.
Interview: Australia's 'cloud-last' policy is dangerous.
Interview: Vivek Kundra on Australia's 'cloud last' policy
Bankwest builds continuous delivery capability
Bankwest builds continuous delivery capability
To automatically deploy test/dev sandboxes by mid-year.
Veterans' Affairs sets sights on modernisation
Veterans' Affairs sets sights on modernisation
Data safe with Human Services, CIO says.
Citi Australia drops platform customisations
Citi Australia drops platform customisations
Technology chief shifts focus from building to leveraging systems.
VicRoads restructures IT team
VicRoads restructures IT team
Department moves to align with industry benchmarks.
Zurich Australia extends IT team offshore
Zurich Australia extends IT team offshore
Malaysian staff served from Australian data centres.
Leigh Berrell - Utilities CIO of the Year
Leigh Berrell - Utilities CIO of the Year
Yarra Valley Water CIO Leigh Berrell accepts his Benchmark Award for Utilities CIO of the Year.
Wayne McMahon - Retail CIO of the Year
Wayne McMahon - Retail CIO of the Year
Domino's Pizza CIO Wayne McMahon accepts his Benchmark Award for Retail CIO of the Year.
Inside Perpetual's ongoing IT transformation
Inside Perpetual's ongoing IT transformation
CIO Jenny Levy discusses how outsourcing will help the firm "simplify, refocus and grow".
Managing Complexity - Defence's Daniel McCabe
Managing Complexity - Defence's Daniel McCabe
Daniel McCabe, Assistant Secretary of Australia's Department of Defence, provides the audience at the iTnews Data Centre Strategy Summit with a deep dive into the organisation's data centre consolidation program.
How Facebook designed the data centre from scratch - Marco Magarelli
How Facebook designed the data centre from scratch - Marco Magarelli
The full keynote by Facebook data centre architect Marco Magarelli at the Australian Data Centre Strategy Summit. Magarelli details the design considerations behind the social network's Prineville, Oregon; North Carolina and Luleå, Sweden data centres.
Modernising Legacy Data Centres - Telstra's Jon Curry
Modernising Legacy Data Centres - Telstra's Jon Curry
Telstra general manager of managed data centres Jon Curry guides the audience at the iTnews Australian Data Centre Summit through the build of the telco's Clayton, Victoria data centre.
NSW Government launches NABERS data centre rating tools
NSW Government launches NABERS data centre rating tools
Matthew Clark from the NSW Department of Environment guides facilties managers through the details of the new NABERS data centre energy rating tool at the Australian Data Centre Strategy Summit.
NABERS launch panel: Australian Data Centre Strategy Summit
NABERS launch panel: Australian Data Centre Strategy Summit
Matthew Clark (NSW Dept of Environment), Greg Boorer (Canberra Data Centres), Glenn Allan (National Australia Bank), Mike Andrea (Strategic Directions) and Bob Sharon (Green Global Consulting) discuss the impact of the NABERS data centre rating.
Judges notes: Fortescue Metals [The Benchmark Awards]
Judges notes: Fortescue Metals [The Benchmark Awards]
iTnews' panel of judges discuss Fortescue Metals 'New World of Work" project, one of three shortlisted finalists for the Industrials category of the CIO Benchmark Awards.
Judges notes: Retail [The Benchmark Awards]
Judges notes: Retail [The Benchmark Awards]
iTnews' panel of judges discuss the shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: Pacific Aluminium [The Benchmark Awards]
Judges notes: Pacific Aluminium [The Benchmark Awards]
iTnews' panel of judges discuss Pacific Aluminium's lightning fast service desk refresh, one of three shortlisted finalists for the Industrials category of the CIO Benchmark Awards.
Judges notes: Domino's Pizza [The Benchmark Awards]
Judges notes: Domino's Pizza [The Benchmark Awards]
iTnews' panel of judges discuss Domino's Pizza's shift to hosted services, one of three shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: McDonald's Australia [The Benchmark Awards]
Judges notes: McDonald's Australia [The Benchmark Awards]
iTnews' panel of judges discuss McDonald's Australia's new self-service portal for employees, one of three shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Latest Comments
Polls
Will you quit any cloud services in light of PRISM?

   |   View results
Yes
  60%
 
No
  40%
TOTAL VOTES: 65

Vote