Amazon's botched backup causes cloud chaos

 

Admits software bug also caused it to delete storage snapshots.

A series of failures in the power infrastructure supporting Amazon’s Elastic Cloud data centre in Ireland caused a rolling European service outage last week.

It was Amazon's first major cloud outage since May that was at the time described as the worst in cloud computing history.

Amazon’s backup generators failed to kick in after its energy supplier suffered a massive power outage on Sunday night August 7, according to a post-incident report.

Power was lost to almost all EC2 instances in Amazon’s European Western zone, RDS database instances, 58 per cent of its Elastic Block Storage volumes and its EC2 networking gear which connected that zone to the internet. 

The failures were traced to Amazon’s Programmable Logic Controllers, which were meant to switch electrical load to backup generators. 

The PLCs failed after detecting “a large ground fault”, similar to what happens when a electrical wire becomes earthed.

They activated some backup generators but left the burden of running its cloud largely on UPS batteries, which “quickly drained”, Amazon said.

Amazon had since added more redundancy and isolation to its PLCs “so they are insulated from other failures” and was preparing a “cold, environmentally isolated” backup PLC that it planned to deploy soon. 

The power outage was initially thought to have been caused by a lightning strike but Amazon’s supplier ruled that out and was still investigating the root cause of the initial power failure.

The timing and severity of the outage hampered Amazon’s efforts to recover elastic block storage (EBS) volumes, which are normally replicated across a network of nodes that serve read-and-write requests to EC2 instances, Amazon explained.

“There were delays as this was nighttime in Dublin and the logistics of trucking required mobilising transportation some distance from the data centre," the company said.

When the outage hit, affected EBS nodes attempted to mirror across available nodes within the region, but many nodes were unable to complete the process before running out of capacity, causing “a number” of customers’ volumes to become “stuck”.

"We ran out of spare capacity before all of the volumes were able to successfully re-mirror," Amazon explained.

“We brought in additional labor to get more onsite capacity online and trucked in servers from another Availability Zone in the Region." 

The worst case was for customers that concurrently lost power to EC2 instances and all nodes containing EBS volume replicas, which Amazon explained threatened to corrupt data if a volume was brought back with errors. 

In that instance, customers were forced to wait up to three days for Amazon to retrieve a snapshot of the node and transfer it to the Amazon Simple Storage Service (S3). 

“While we provided best estimates for the long-lead recovery snapshots, we truly didn’t know how long that process was going to take or we would have shared it,” Amazon explained, defending the quality of its communications with customers throughout the outage. 

Amazon planned better communications resources for customers in the future, such as visibility over their own resources.

It expected to deliver the tools for customers to see whether instances or volumes had been impaired in the “next few months”.

It also promised to provide clearer instructions on what customers should do with the recovery snapshots.

“We sometimes assume a certain familiarity with these tools that we should not.”

Amazon has offered customers running applications in the affected zone 10 days credit equal to 100 per cent of their usage of EBS Volumes, EC2 and RDS instances.

Why Amazon accidentally deleted EBS snapshots

Amazon’s devastating outage occurred just days after one of its engineers accidentally deleted a batch of EBS volume snapshots.

Amazon has revealed that it was due to a software bug affecting how Elastic Block Storage (EBS) takes and deletes snapshots.

“On August 5th, the engineer running the snapshot deletion process checked the blocks flagged for analysis before running the actual deletion process in the EU West Region.

"The human checks in this process failed to detect the error and the deletion process was executed,” it said.  

Customers that had EBS volume snapshots deleted will receive 30 days credit for their EBS usage in the region, it said. 

Copyright © iTnews.com.au . All rights reserved.


Amazon's botched backup causes cloud chaos
 
 
 
 
Top Stories
CenITex to move from IT provider to broker
Documents reveal new strategy.
 
eHealth measures missing the point
Opinion: When will the PCEHR lead to patient outcomes?
 
Photos: Google Glass gets real
Coming soon to an office near you.
 
 
Sign up to receive iTnews email bulletins
   FOLLOW US...

Latest VideosSee all videos »

Bankwest builds continuous delivery capability
Bankwest builds continuous delivery capability
To automatically deploy test/dev sandboxes by mid-year.
Veterans' Affairs sets sights on modernisation
Veterans' Affairs sets sights on modernisation
Data safe with Human Services, CIO says.
Citi Australia drops platform customisations
Citi Australia drops platform customisations
Technology chief shifts focus from building to leveraging systems.
VicRoads restructures IT team
VicRoads restructures IT team
Department moves to align with industry benchmarks.
Zurich Australia extends IT team offshore
Zurich Australia extends IT team offshore
Malaysian staff served from Australian data centres.
Leigh Berrell - Utilities CIO of the Year
Leigh Berrell - Utilities CIO of the Year
Yarra Valley Water CIO Leigh Berrell accepts his Benchmark Award for Utilities CIO of the Year.
Wayne McMahon - Retail CIO of the Year
Wayne McMahon - Retail CIO of the Year
Domino's Pizza CIO Wayne McMahon accepts his Benchmark Award for Retail CIO of the Year.
Inside Perpetual's ongoing IT transformation
Inside Perpetual's ongoing IT transformation
CIO Jenny Levy discusses how outsourcing will help the firm "simplify, refocus and grow".
Managing Complexity - Defence's Daniel McCabe
Managing Complexity - Defence's Daniel McCabe
Daniel McCabe, Assistant Secretary of Australia's Department of Defence, provides the audience at the iTnews Data Centre Strategy Summit with a deep dive into the organisation's data centre consolidation program.
How Facebook designed the data centre from scratch - Marco Magarelli
How Facebook designed the data centre from scratch - Marco Magarelli
The full keynote by Facebook data centre architect Marco Magarelli at the Australian Data Centre Strategy Summit. Magarelli details the design considerations behind the social network's Prineville, Oregon; North Carolina and Luleå, Sweden data centres.
Modernising Legacy Data Centres - Telstra's Jon Curry
Modernising Legacy Data Centres - Telstra's Jon Curry
Telstra general manager of managed data centres Jon Curry guides the audience at the iTnews Australian Data Centre Summit through the build of the telco's Clayton, Victoria data centre.
NSW Government launches NABERS data centre rating tools
NSW Government launches NABERS data centre rating tools
Matthew Clark from the NSW Department of Environment guides facilties managers through the details of the new NABERS data centre energy rating tool at the Australian Data Centre Strategy Summit.
NABERS launch panel: Australian Data Centre Strategy Summit
NABERS launch panel: Australian Data Centre Strategy Summit
Matthew Clark (NSW Dept of Environment), Greg Boorer (Canberra Data Centres), Glenn Allan (National Australia Bank), Mike Andrea (Strategic Directions) and Bob Sharon (Green Global Consulting) discuss the impact of the NABERS data centre rating.
Judges notes: Fortescue Metals [The Benchmark Awards]
Judges notes: Fortescue Metals [The Benchmark Awards]
iTnews' panel of judges discuss Fortescue Metals 'New World of Work" project, one of three shortlisted finalists for the Industrials category of the CIO Benchmark Awards.
Judges notes: Retail [The Benchmark Awards]
Judges notes: Retail [The Benchmark Awards]
iTnews' panel of judges discuss the shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: Pacific Aluminium [The Benchmark Awards]
Judges notes: Pacific Aluminium [The Benchmark Awards]
iTnews' panel of judges discuss Pacific Aluminium's lightning fast service desk refresh, one of three shortlisted finalists for the Industrials category of the CIO Benchmark Awards.
Judges notes: Domino's Pizza [The Benchmark Awards]
Judges notes: Domino's Pizza [The Benchmark Awards]
iTnews' panel of judges discuss Domino's Pizza's shift to hosted services, one of three shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: McDonald's Australia [The Benchmark Awards]
Judges notes: McDonald's Australia [The Benchmark Awards]
iTnews' panel of judges discuss McDonald's Australia's new self-service portal for employees, one of three shortlisted finalists for the Retail category of the CIO Benchmark Awards.
Judges notes: ING Direct [The Benchmark Awards]
Judges notes: ING Direct [The Benchmark Awards]
iTnews' panel of judges discuss ING Direct's 'Bank in a Box', one of three shortlisted finalists for the banking and finance category of the CIO Benchmark Awards.
Judges notes: Yarra Valley Water [The Benchmark Awards]
Judges notes: Yarra Valley Water [The Benchmark Awards]
iTnews' panel of judges discuss Yarra Valley Water's insourcing project, one of three shortlisted finalists for the Utilities category of the CIO Benchmark Awards.
Latest Comments
Polls
Do you prefer the Coalition's NBN policy?

   |   View results
Yes
  19%
 
No
  81%
TOTAL VOTES: 1690

Vote