iTnews
  • Home
  • News
  • Technology
  • Storage

Analysis: Lifting the covers off Amazon's cloud outage

By Justin Warren
May 3 2011 6:43AM
Follow google news

Post-mortem reveals some of the magic under the hood.

Amazon Web Services (AWS) has published a post-mortem of its Easter cloud outage that provides the IT industry a unique opportunity to study what technologies the world's largest provider of cloud computing uses to provide resilient services.

Analysis: Lifting the covers off Amazon's cloud outage

The post-mortem was published to provide a detailed breakdown of what caused major outages on April 21 that took down many popular websites, including Reddit and Foursquare.

The announcement also provided a detailed description of how the AWS cloud is designed, what went wrong during the outage, and what the industry can learn to better prepare similar events in the future.

The trigger for the outage - referred on Amazon's service status page only as a "network event" - was a mistake made during a scheduled upgrade of capacity on the primary network for Amazon's Elastic Block Storage (EBS) service, which underpins AWS.

The mistake caused all traffic that would normally use the primary, high-capacity network to instead use a second, lower-capacity network designed for reliable communications and overflow capacity. The secondary network was quickly overwhelmed, which triggered a cascade of other issues, resulting in a "re-mirroring storm".

EBS consists of clusters of storage nodes, connected in a peer-to-peer fashion, with each node storing a replica of EBS data "volumes". These volumes are used for data read and write operations.

EBS clusters are grouped into Availability Zones where each Zone contains, it appears, a single EBS cluster. Availability Zones are further grouped into geographical Regions, where each Region operates independently.

Each cluster also runs a variety of services that are collectively referred to as the "control plane", which is distributed across Availability Zones in the Region to provide availability and fault tolerance.

This technical description of EBS clusters resembles the architecture of IBM's XIV storage array, where portions of data are replicated to other physical locations, and if a node or disk fails, new replicas are built to ensure no data is lost. Google's GFS "chunks" is another example of the same technique.

When a node fails, and re-mirroring is required, EBS uses the control plane to find a suitable target node to replicate data to. During the outage, a large number of EBS nodes assumed replica destinations had failed because they lost network connectivity, causing a storm of control messages.

As previously described on iTnews, the control plane essentially experienced a distributed denial-of-service attack.

Other AWS services, such as the Relational Database Service (RDS) and Elastic Compute Cloud (EC2) instances, rely on the EBS control plane to, for example, know where the primary, writeable copy of a data volume is located. This was particularly challenging for RDS instances, which often make use of numerous EBS volumes; if any EBS volume was affected, the RDS instance would also be affected.

Lessons learned

Amazon's outage demonstrates that if an application requires superior uptime, relying on a single AWS Region places it at risk during a control plane outage.

Amazon has in response to the outage committed to making it easier for customers to take advantage of multiple Availability Zones, and to better isolate the control plane from issues in any one Availability Zone. Still, for maximum resilience, application designers should consider making use of multiple regions, though this will no doubt add complexity and cost.

Amazon have used a variety of standard industry techniques (clustered servers, redundant networks, separation of data and control), coupled with some clever application programming to provide cloud services that are unparalleled in their stability and performance.

And yet, as we have seen, perfection eludes us, as human error can expose design flaws that have widespread and costly consequences. 

However, by providing this detailed description of how AWS, and EBS in particular, works, Amazon has provided a tremendous opportunity to ensure that IT architects design applications to cater for the possible failure modes of the infrastructure on which they run.

Add iTnews as your trusted source

Add iTnews As Your Trusted Source Add iTnews As Your Trusted Source
Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:
amazonamazon web servicesawscloud computingcloud storageebspublic cloudstorage

Related Articles

  • Government data sharing law falls flat Government data sharing law falls flat
  • APRA to modernise data stack with Databricks on Azure APRA to modernise data stack with Databricks on Azure
  • CASA exploring AI for digital asset operations CASA exploring AI for digital asset operations
  • In Pictures: NEXTDC & Vocus AI infrastructure roundtable in Melbourne In Pictures: NEXTDC & Vocus AI infrastructure roundtable in Melbourne
Join our WhatsApp Channel

Partner Content

Why resilient communications are becoming critical infrastructure for modern enterprise IT
Promoted Content Why resilient communications are becoming critical infrastructure for modern enterprise IT
Scalable AI solutions: secure delivery
Scalable AI solutions: secure delivery
CommBank creates opportunities for technologists to upskill  with frontier AI companies
Partner Content CommBank creates opportunities for technologists to upskill with frontier AI companies
Thomas Peer Solutions unveils data cloud platform and executive leadership forum for 2026
Partner Content Thomas Peer Solutions unveils data cloud platform and executive leadership forum for 2026

Sponsored Whitepapers

Agile in the AI Era: why projects still fail
Agile in the AI Era: why projects still fail
When Technology Becomes the Blocker: Unlocking Real Outcomes from AI and Cloud
When Technology Becomes the Blocker: Unlocking Real Outcomes from AI and Cloud
High-volume data sources for AI-driven security analytics
High-volume data sources for AI-driven security analytics
How healthcare organisations can get more value from cloud
How healthcare organisations can get more value from cloud
1 in 3 companies lose SaaS data. Here’s how to prevent it
1 in 3 companies lose SaaS data. Here’s how to prevent it

Events

  • iTnews State of Security Breakfast iTnews State of Security Breakfast
  • iTnews State of Data & AI Breakfast iTnews State of Data & AI Breakfast
  • The 2026 iAwards The 2026 iAwards
  • Integrate 2026 Integrate 2026
  • Security Exhibition & Conference Security Exhibition & Conference
Share on Facebook Share on LinkedIn Share on Whatsapp Email A Friend

Most Read Articles

NAB uses Ada to shift to real-time data ingestion

NAB uses Ada to shift to real-time data ingestion

All-flash storage slowly making its mark on Aussie enterprise

All-flash storage slowly making its mark on Aussie enterprise

ATO to ingest daily Medicare data to check levy exemption claims

ATO to ingest daily Medicare data to check levy exemption claims

NAB live-streamed the end of its Teradata platform, thousands tuned in

NAB live-streamed the end of its Teradata platform, thousands tuned in

techpartner.news logo
Sydney-based AI-cloud waste startup raises $3m
Sydney-based AI-cloud waste startup raises $3m
Brennan uses NiCE to modernise its contact centre
Brennan uses NiCE to modernise its contact centre
Impact Awards: Tecala slashes customer response times for fintech IQumulate
Impact Awards: Tecala slashes customer response times for fintech IQumulate
Interactive introduces private cloud platform
Interactive introduces private cloud platform
Digital61 expands cybersecurity portfolio
Digital61 expands cybersecurity portfolio
All rights reserved. This material may not be published, broadcast, rewritten or redistributed in any form without prior authorisation.
Your use of this website constitutes acceptance of nextmedia's Privacy Policy and Terms & Conditions.