A large-scale outage affecting Amazon Web Services' Elastic Compute Cloud (EC2) over the Easter break has highlighted one of the many risks associated with running thousands of applications from large clusters of virtualised servers.

The outage - which began Thursday last week and still impacted a number of Amazon customers as late as Tuesday - took out popular social networking services Foursquare, FormSpring, Heroku, HootSuite, Quora and Reddit.
It also impacted many IT service providers that use Amazon as part of their total solution to end users, such as cloud management software vendor Right Scale.
The outage began Thursday at 6pm (Sydney time), when customers hosting applications at Amazon's cloud compute service (EC2) based at the US-EAST-1 data centre (in Virginia) began experiencing connectivity, latency and error rates.
On any normal day Amazon’s EBS - a giant storage area network - dynamically distributes small volumes of storage capacity to thousands of the physical servers hosting Amazon EC2 virtual server instances and to applications using Amazon’s Relational Database Service (RDS).
According to Amazon's status updates, a mysterious network event caused software monitoring of the EC2 network to incorrectly calculate that there was insufficient redundancy available to meet the needs of these server and database instances.
The software automatically attempted to move resources around the network to adjust – in effect a mass re-mirroring of storage that flooded the network with traffic.
In the chaos that ensued, Amazon ran out of capacity in its US-EAST-1 Availability Zone – with services failing faster than Amazon engineers could re-provision them.
Six hours after the unexplained network event, Amazon's technicians reported to customers that EBS-backed instances in the US-EAST-1 region were "failing at a high rate."
"Effectively, [Amazon's high availability software] launched a Denial-of-Service attack on their own infrastructure," explained Matt Moor, technical architect at a Sydney-based Amazon customer Bulletproof Networks, which fortunately relies primarily on providing services from its own infrastructure in Australia.
The outage initially affected multiple Amazon ‘availability zones’ (data centres) but by Friday morning, the company reported that most server instances across its compute cloud were again operational - with the exception of any applications hosted in the US-EAST-1 Availability Zone.
Customers with applications hosted in this zone suffered outages well into the weekend as Amazon engineers struggled to bring the large volume of services back online.
"The work we're doing to enable customers to be able to launch EBS backed instances and create, delete, attach and detach EBS volumes in the affected Availability Zone is taking considerably more time than we anticipated," Amazon's engineers reported on the company's status page.
Instances had to be re-provisioned slowly, the company reported, "in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions."
By Monday, Amazon reported that most instances were operational, advising those customers still experiencing issues to "stop and restart your instance in order to restore connectivity."
How did customers rate Amazon's response to the outage? Read on...
The cloud compute operator said it was contacting a percentage of customers hosted in this zone whose services could not immediately be restored.
Customers were particularly critical of Amazon’s refusal to reveal what percentage of customers or precisely which availability zones were back online at any stage of the outage.
Amazon Web Services staff - usually quite chatty on social networks - did not respond to any questions or requests for help during the outage.
At the time of writing Amazon was yet to publish a post-mortem on what network event caused the system to malfunction.
The last message on Amazon's service status page said that once customers were "fully back up and running" the company would “post a detailed account of what happened, along with the corrective actions we are undertaking to ensure this doesn’t happen again.”
Thorsten von Eicken, CTO and Founder of Right Scale described it as the “worst outage in cloud computing history” and a “wake up call” to the industry – suggesting that the “ripple effects” from the initial failure through to other Amazon services “should not have happened.”
Public cloud customers would be wise to review backup and restore processes and ensure they straddle multiple geographies and services, von Eicken advised in a blog post.
Amazon customers had found themselves victims of vendor lock-in, he said, as its database-as-a-service product does not allow customers to failover to third party providers.
“The obvious failure here is compounded by the fact that Amazon has made it difficult for users to backup their databases outside of RDS, leaving them no choice but to wait for someone at Amazon to work on their database. This lock-in is one reason many of our customers prefer to use our MySQL master-slave setup or to architect their own,” von Eiken said.
Lorenzo Modesto, COO at Bulletproof Networks said the outage reinforced fears in the Australian business community about the use of large-scale public compute clouds.
"Ironically, the outage was sparked by the very high-availability features built into the services themselves," he noted.
"The additional complexity Amazon has architected into their services to deliver the availability required by mission critical applications has introduced an exceptionally long and complex outage, which they're still struggling with after several days."
Telcos building their own cloud computes should take note, Modesto said, that a "cookie cutter approach" to delivering services won't suffice for the enterprise, which requires closer collaboration between hosting providers and application developers.