Amazon offers explanation to recent outages

 

Elastic load balancer bug caused flood of requests.

Failed generators and a lengthy server reboot process exacerbated outages to Amazon Web Services over the weekend.

A detailed explanation of the outage released by the cloud provider this week attempted to minimise the event — saying a mere seven percent of virtual machine instances in the Virginia availability region were affected by the incident. However, the company conceded the recovery time for those affected was extended by a bottleneck in its server reboot process.

The outage on June 29, caused primariily by a huge thunder storm passing through Virginia, saw several large internet sites including Netflix, Instagram and Reddit fail, and led to widespread criticism of Amazon’s hosting service.

While backup equipment at most of the ten data centres in Amazon’s US East-1 region kicked in as designed — allowing the facility to cope with mains grid power fluctuations caused by the storm — the generators at one site failed to provide stable voltage.

As the generators failed, battery-backed uninterruptible powers supplies took over at the data centre keeping servers running. However, a second power outage roughly half an hour late meant that the already depleted UPSs started to drop off, and servers began shutting down.

Power was restored 25 minutes after the second outage but Elastic Cloud Compute (EC2) instances and Elastic Block Store (EBS) volumes in the affected data centre were unavailable to customers.

For an hour, customers were unable to launch new EC2 instances or create EBS volumes as the control planes for these two resources keeled over during the power failure.

The Relation Database Services was also hit by a bug that meant some multi-availability zone instances did not complete failover process.

EBS volumes that were in use were hobbled when brought back online, and all activity on them paused so customers could verify data consistency.

It took several hours to recover some EBS volumes completely, a process Amazon said it was working to improve.

At the same time, Amazon’s Elastic Load Balancers (ELBs) encountered a previously unknown bug that caused a flood of requests and created a backlog in other AWS zones not affected by the storm.

Amazon said it would break load balance processing into multiple queues in future to avoid a similar incident as well as to allow faster processing of time-sensitive actions.

It also intends to develop a DNS reweighting backup system that will quickly shift all load balance traffic away from an availability zone in trouble.

Copyright © iTnews.com.au . All rights reserved.


Amazon offers explanation to recent outages
 
 
 
Top Stories
Innovating in the sleepy super industry
There’s little incentive to be on the bleeding edge, so why is Andrew Todd fighting so hard?
 
How technology will unify Toll
The systems headache formed through 15 years of acquisitions.
 
Immigration breached Privacy Act with data leak
Pilgrim slams "copy and paste" of asylum seeker data.
 
 
Sign up to receive iTnews email bulletins
   FOLLOW US...
Latest Comments
Polls
Who do you trust most to protect your private data?







   |   View results
Your bank
  38%
 
Your insurance company
  3%
 
A technology company (Google, Facebook et al)
  7%
 
Your telco, ISP or utility
  8%
 
A retailer (Coles, Woolworths et al)
  2%
 
A Federal Government agency (ATO, Centrelink etc)
  20%
 
An Australian law enforcement agency (AFP, ASIO et al)
  15%
 
A State Government agency (Health dept, etc)
  5%
TOTAL VOTES: 841

Vote