iTnews
  • Home
  • News
  • Technology
  • Networking

Amazon's botched backup causes cloud chaos

By Liam Tung
Aug 17 2011 6:37AM
Follow google news

Admits software bug also caused it to delete storage snapshots.

A series of failures in the power infrastructure supporting Amazon’s Elastic Cloud data centre in Ireland caused a rolling European service outage last week.

Amazon's botched backup causes cloud chaos

It was Amazon's first major cloud outage since May that was at the time described as the worst in cloud computing history.

Amazon’s backup generators failed to kick in after its energy supplier suffered a massive power outage on Sunday night August 7, according to a post-incident report.

Power was lost to almost all EC2 instances in Amazon’s European Western zone, RDS database instances, 58 per cent of its Elastic Block Storage volumes and its EC2 networking gear which connected that zone to the internet. 

The failures were traced to Amazon’s Programmable Logic Controllers, which were meant to switch electrical load to backup generators. 

The PLCs failed after detecting “a large ground fault”, similar to what happens when a electrical wire becomes earthed.

They activated some backup generators but left the burden of running its cloud largely on UPS batteries, which “quickly drained”, Amazon said.

Amazon had since added more redundancy and isolation to its PLCs “so they are insulated from other failures” and was preparing a “cold, environmentally isolated” backup PLC that it planned to deploy soon. 

The power outage was initially thought to have been caused by a lightning strike but Amazon’s supplier ruled that out and was still investigating the root cause of the initial power failure.

The timing and severity of the outage hampered Amazon’s efforts to recover elastic block storage (EBS) volumes, which are normally replicated across a network of nodes that serve read-and-write requests to EC2 instances, Amazon explained.

“There were delays as this was nighttime in Dublin and the logistics of trucking required mobilising transportation some distance from the data centre," the company said.

When the outage hit, affected EBS nodes attempted to mirror across available nodes within the region, but many nodes were unable to complete the process before running out of capacity, causing “a number” of customers’ volumes to become “stuck”.

"We ran out of spare capacity before all of the volumes were able to successfully re-mirror," Amazon explained.

“We brought in additional labor to get more onsite capacity online and trucked in servers from another Availability Zone in the Region." 

The worst case was for customers that concurrently lost power to EC2 instances and all nodes containing EBS volume replicas, which Amazon explained threatened to corrupt data if a volume was brought back with errors. 

In that instance, customers were forced to wait up to three days for Amazon to retrieve a snapshot of the node and transfer it to the Amazon Simple Storage Service (S3). 

“While we provided best estimates for the long-lead recovery snapshots, we truly didn’t know how long that process was going to take or we would have shared it,” Amazon explained, defending the quality of its communications with customers throughout the outage. 

Amazon planned better communications resources for customers in the future, such as visibility over their own resources.

It expected to deliver the tools for customers to see whether instances or volumes had been impaired in the “next few months”.

It also promised to provide clearer instructions on what customers should do with the recovery snapshots.

“We sometimes assume a certain familiarity with these tools that we should not.”

Amazon has offered customers running applications in the affected zone 10 days credit equal to 100 per cent of their usage of EBS Volumes, EC2 and RDS instances.

Why Amazon accidentally deleted EBS snapshots

Amazon’s devastating outage occurred just days after one of its engineers accidentally deleted a batch of EBS volume snapshots.

Amazon has revealed that it was due to a software bug affecting how Elastic Block Storage (EBS) takes and deletes snapshots.

“On August 5th, the engineer running the snapshot deletion process checked the blocks flagged for analysis before running the actual deletion process in the EU West Region.

"The human checks in this process failed to detect the error and the deletion process was executed,” it said.  

Customers that had EBS volume snapshots deleted will receive 30 days credit for their EBS usage in the region, it said. 

Add iTnews as your trusted source

Add iTnews As Your Trusted Source Add iTnews As Your Trusted Source
Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:
amazonbackupclouddisasterebsec2failureinstancenetworkingoutagepowerrdsrecoverystorageups

Related Articles

  • Aurora Energy to modernise its ERP system Aurora Energy to modernise its ERP system
  • Perth Airport to deploy 70 IT, OT systems for new terminal Perth Airport to deploy 70 IT, OT systems for new terminal
  • In Pictures: iTnews Cloud Covered Breakfast Summit - Sydney In Pictures: iTnews Cloud Covered Breakfast Summit - Sydney
  • Team Global Express has the logistics for AI use in place Team Global Express has the logistics for AI use in place
Join our WhatsApp Channel

Partner Content

Intelligence × Trust: the equation that will decide Australia's AI winners
Promoted Content Intelligence × Trust: the equation that will decide Australia's AI winners
The hidden economics of AI: Why token usage matters more than you think
Partner Content The hidden economics of AI: Why token usage matters more than you think
Thomas Peer Solutions unveils data cloud platform and executive leadership forum for 2026
Partner Content Thomas Peer Solutions unveils data cloud platform and executive leadership forum for 2026
CommBank creates opportunities for technologists to upskill  with frontier AI companies
Partner Content CommBank creates opportunities for technologists to upskill with frontier AI companies

Sponsored Whitepapers

Agile in the AI Era: why projects still fail
Agile in the AI Era: why projects still fail
When Technology Becomes the Blocker: Unlocking Real Outcomes from AI and Cloud
When Technology Becomes the Blocker: Unlocking Real Outcomes from AI and Cloud
High-volume data sources for AI-driven security analytics
High-volume data sources for AI-driven security analytics
How healthcare organisations can get more value from cloud
How healthcare organisations can get more value from cloud
1 in 3 companies lose SaaS data. Here’s how to prevent it
1 in 3 companies lose SaaS data. Here’s how to prevent it

Events

  • iTnews State of Security Breakfast iTnews State of Security Breakfast
  • iTnews State of Data & AI Breakfast iTnews State of Data & AI Breakfast
  • The 2026 iAwards The 2026 iAwards
  • Integrate 2026 Integrate 2026
  • Security Exhibition & Conference Security Exhibition & Conference
Share on Facebook Share on LinkedIn Share on Whatsapp Email A Friend

Most Read Articles

WA man jailed for at least five years for evil twin attack

WA man jailed for at least five years for evil twin attack

Optus fast-tracks network operations insourcing from Nokia

Optus fast-tracks network operations insourcing from Nokia

The Asus ZenWiFi Pro XT12 delivers fast, reliable wireless networking for SMBs

The Asus ZenWiFi Pro XT12 delivers fast, reliable wireless networking for SMBs

Australia Post deploys ThousandEyes across its retail network

Australia Post deploys ThousandEyes across its retail network

techpartner.news logo
Sydney-based AI-cloud waste startup raises $3m
Sydney-based AI-cloud waste startup raises $3m
Brennan uses NiCE to modernise its contact centre
Brennan uses NiCE to modernise its contact centre
Impact Awards: Tecala slashes customer response times for fintech IQumulate
Impact Awards: Tecala slashes customer response times for fintech IQumulate
Interactive introduces private cloud platform
Interactive introduces private cloud platform
Digital61 expands cybersecurity portfolio
Digital61 expands cybersecurity portfolio
All rights reserved. This material may not be published, broadcast, rewritten or redistributed in any form without prior authorisation.
Your use of this website constitutes acceptance of nextmedia's Privacy Policy and Terms & Conditions.