Growing pains: Amazon EC2 suffers huge outage

 
Page 1 of 2 | Single page

Described as the "worst outage in cloud computing history".

A large-scale outage affecting Amazon Web Services' Elastic Compute Cloud (EC2) over the Easter break has highlighted one of the many risks associated with running thousands of applications from large clusters of virtualised servers.

The outage - which began Thursday last week and still impacted a number of Amazon customers as late as Tuesday - took out popular social networking services Foursquare, FormSpring, Heroku, HootSuite, Quora and Reddit.

It also impacted many IT service providers that use Amazon as part of their total solution to end users, such as cloud management software vendor Right Scale.

The outage began Thursday at 6pm (Sydney time), when customers hosting applications at Amazon's cloud compute service (EC2) based at the US-EAST-1 data centre (in Virginia) began experiencing connectivity, latency and error rates.

On any normal day Amazon’s EBS - a giant storage area network - dynamically distributes small volumes of storage capacity to thousands of the physical servers hosting Amazon EC2 virtual server instances and to applications using Amazon’s Relational Database Service (RDS).

According to Amazon's status updates, a mysterious network event caused software monitoring of the EC2 network to incorrectly calculate that there was insufficient redundancy available to meet the needs of these server and database instances.

The software automatically attempted to move resources around the network to adjust – in effect a mass re-mirroring of storage that flooded the network with traffic.

In the chaos that ensued, Amazon ran out of capacity in its US-EAST-1 Availability Zone – with services failing faster than Amazon engineers could re-provision them.

Six hours after the unexplained network event, Amazon's technicians reported to customers that EBS-backed instances in the US-EAST-1 region were "failing at a high rate."

"Effectively, [Amazon's high availability software] launched a Denial-of-Service attack on their own infrastructure," explained Matt Moor, technical architect at a Sydney-based Amazon customer Bulletproof Networks, which fortunately relies primarily on providing services from its own infrastructure in Australia.

The outage initially affected multiple Amazon ‘availability zones’ (data centres) but by Friday morning, the company reported that most server instances across its compute cloud were again operational - with the exception of any applications hosted in the US-EAST-1 Availability Zone.

Customers with applications hosted in this zone suffered outages well into the weekend as Amazon engineers struggled to bring the large volume of services back online.

"The work we're doing to enable customers to be able to launch EBS backed instances and create, delete, attach and detach EBS volumes in the affected Availability Zone is taking considerably more time than we anticipated," Amazon's engineers reported on the company's status page.

Instances had to be re-provisioned slowly, the company reported, "in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions."

By Monday, Amazon reported that most instances were operational, advising those customers still experiencing issues to "stop and restart your instance in order to restore connectivity."

How did customers rate Amazon's response to the outage? Read on...

Copyright © iTnews.com.au . All rights reserved.


Growing pains: Amazon EC2 suffers huge outage
"Too many nuffies pushing the "cloud world" as the new way of outsourcing. When you move to the cloud you put your systems in the hands of someone else - beware. Maintain backups and your own ..."
By DJ
 
 
 
Comments: 3
BaysNet
Apr 27, 2011 1:19 PM
The Take outy here is to look for the interoperability and portability features of your cloud provider so YOU can take back control of your apps and data when and how you want. Elastic EC2 may be but if you stretch it too far it does break!
Lloyd
Apr 27, 2011 5:08 PM
I followed this 80+ hour sorry saga over Easter (e.g. via the AWS Service Health Dashboard), and also the responses of AMZN clients such as dotcloud.com. I know what it's like to have your server down and sympatise with the clients and the guys at AMZN trying to understand and fix the problem.

Before we even know what the cause is, we have people looking at SLA agreements (e.g. Lydia Leong at Gartner, who pointed out that the EC2 guarantee does not apply to component system (e.g. EBS and RDS) failure).

And then, to put an Australian perspective on this I looked at the reassuring advice in the DSD "Cloud Computing Considerations" (page 6), which suggest clients ensure that "The Service Level Agreement (SLA) GUARANTEES adequate system availability". Uh .. huh ...

The responses of clients were frustration and acceptance of the realisation of a worst scenario risk. Bottom line is to recognise that the whole idea of the internet (from the DARPANet onwards) was to ensure that you don't have all your eggs in one basket. Use multiple Availability Zones, and - if your architecture can afford it - multiple suppliers.

Our business has multiple services with a mix of "close to the metal" (for production) and virtual (for backup and staging) environments, and we use virtual environments (including Rackspace cloud) for development. It's all about performance and risk management.
DJ
May 2, 2011 6:50 PM
Too many nuffies pushing the "cloud world" as the new way of outsourcing.

When you move to the cloud you put your systems in the hands of someone else - beware.

Maintain backups and your own copies of anything you outsource.... or pay the price when the "cloud" rains and takes your business systems offline.

If it sounds too good to be true.....
Comments have been disabled for this article.
 
 
Top Stories
ATO commits to complexity
Greater demand, fewer apps.
 
Photos: AusCERT 2013 day two
The second day of the Queensland security conference.
 
The illusion of cognitive computing
Opinion: IBM's Watson is a marketing success.
 
 
Sign up to receive iTnews email bulletins
   FOLLOW US...

Latest VideosSee all videos »

Bankwest builds continuous delivery capability
Bankwest builds continuous delivery capability
To automatically deploy test/dev sandboxes by mid-year.
Veterans' Affairs sets sights on modernisation
Veterans' Affairs sets sights on modernisation
Data safe with Human Services, CIO says.
Citi Australia drops platform customisations
Citi Australia drops platform customisations
Technology chief shifts focus from building to leveraging systems.
VicRoads restructures IT team
VicRoads restructures IT team
Department moves to align with industry benchmarks.
Zurich Australia extends IT team offshore
Zurich Australia extends IT team offshore
Malaysian staff served from Australian data centres.
Leigh Berrell - Utilities CIO of the Year
Leigh Berrell - Utilities CIO of the Year
Yarra Valley Water CIO Leigh Berrell accepts his Benchmark Award for Utilities CIO of the Year.
Wayne McMahon - Retail CIO of the Year
Wayne McMahon - Retail CIO of the Year
Domino's Pizza CIO Wayne McMahon accepts his Benchmark Award for Retail CIO of the Year.
Inside Perpetual's ongoing IT transformation
Inside Perpetual's ongoing IT transformation
CIO Jenny Levy discusses how outsourcing will help the firm "simplify, refocus and grow".
Managing Complexity - Defence's Daniel McCabe
Managing Complexity - Defence's Daniel McCabe
Daniel McCabe, Assistant Secretary of Australia's Department of Defence, provides the audience at the iTnews Data Centre Strategy Summit with a deep dive into the organisation's data centre consolidation program.
How Facebook designed the data centre from scratch - Marco Magarelli
How Facebook designed the data centre from scratch - Marco Magarelli
The full keynote by Facebook data centre architect Marco Magarelli at the Australian Data Centre Strategy Summit. Magarelli details the design considerations behind the social network's Prineville, Oregon; North Carolina and Luleå, Sweden data centres.
Modernising Legacy Data Centres - Telstra's Jon Curry
Modernising Legacy Data Centres - Telstra's Jon Curry
Telstra general manager of managed data centres Jon Curry guides the audience at the iTnews Australian Data Centre Summit through the build of the telco's Clayton, Victoria data centre.
NSW Government launches NABERS data centre rating tools
NSW Government launches NABERS data centre rating tools
Matthew Clark from the NSW Department of Environment guides facilties managers through the details of the new NABERS data centre energy rating tool at the Australian Data Centre Strategy Summit.
NABERS launch panel: Australian Data Centre Strategy Summit
NABERS launch panel: Australian Data Centre Strategy Summit
Matthew Clark (NSW Dept of Environment), Greg Boorer (Canberra Data Centres), Glenn Allan (National Australia Bank), Mike Andrea (Strategic Directions) and Bob Sharon (Green Global Consulting) discuss the impact of the NABERS data centre rating.
Judges notes: Fortescue Metals [The Benchmark Awards]
Judges notes: Fortescue Metals [The Benchmark Awards]
iTnews' panel of judges discuss Fortescue Metals 'New World of Work" project, one of three shortlisted finalists for the Industrials category of the CIO Benchmark Awards.
Judges notes: Retail [The Benchmark Awards]
Judges notes: Retail [The Benchmark Awards]
iTnews' panel of judges discuss the shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: Pacific Aluminium [The Benchmark Awards]
Judges notes: Pacific Aluminium [The Benchmark Awards]
iTnews' panel of judges discuss Pacific Aluminium's lightning fast service desk refresh, one of three shortlisted finalists for the Industrials category of the CIO Benchmark Awards.
Judges notes: Domino's Pizza [The Benchmark Awards]
Judges notes: Domino's Pizza [The Benchmark Awards]
iTnews' panel of judges discuss Domino's Pizza's shift to hosted services, one of three shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: McDonald's Australia [The Benchmark Awards]
Judges notes: McDonald's Australia [The Benchmark Awards]
iTnews' panel of judges discuss McDonald's Australia's new self-service portal for employees, one of three shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: ING Direct [The Benchmark Awards]
Judges notes: ING Direct [The Benchmark Awards]
iTnews' panel of judges discuss ING Direct's 'Bank in a Box', one of three shortlisted finalists for the banking and finance category of the CIO Benchmark Awards.
Judges notes: Yarra Valley Water [The Benchmark Awards]
Judges notes: Yarra Valley Water [The Benchmark Awards]
iTnews' panel of judges discuss Yarra Valley Water's insourcing project, one of three shortlisted finalists for the Utilities category of the CIO Benchmark Awards.
Latest Comments
Polls
Do you prefer the Coalition's NBN policy?

   |   View results
Yes
  19%
 
No
  81%
TOTAL VOTES: 1730

Vote