Failure in power redundancy triggered AWS Sydney outage

By Allie Coyne

Jun 9 2016 2:23PM

Post-mortem of weekend problems.

The outage to an Amazon Web Services Sydney availability zone over the weekend following massive storms was caused by a failure in the company's uninterruptible power supply (UPS) setup.

Failure in power redundancy triggered AWS Sydney outage

On Sunday afternoon some of Australia's biggest web properties had their systems sent offline following severe storms in the region.

EC2 and EBS instances in one of AWS' availability zones became unreachable, and the power issues created flow-on problems for other services including Elastic Search, APIs and internal DNS.

In a post-mortem of the event published today, AWS revealed a diesel rotary uninterruptible power supply (DRUPS) had failed to properly switch to its reserve power when the utility power fell over.

DRUPS stores energy using utility power and dips into this reserve energy to power the data centre when the main line goes out, while it waits for the generator to be powered on.

But a failure at AWS' utility power provider over the weekend had resulted in an "unusually long voltage sag", meaning a set of breakers that isloate DRUPS from the utility power didn't open fast enough and the DRUPS' reserve energy drained fast into the power grid.

The DRUPS then shut down, meaning the generators which were in the process of starting up were unable to complete the process.

"DRUPS shutting down this rapidly and in this fashion is unusual and required some inspection," AWS said.

Every AWS instance is served by two independent power delivery line-ups that provide access to utility power, UPS, and generator back-ups, AWS said. If either of these lines has power, the instance will remain online.

However, in the weekend's outage, those that were affected lost access to both primary and secondary power after several power delivery line-ups failed to transfer load to generators because of the DRUPS glitch, AWS said.

It also revealed a bug in its instance management software meant instances that weren't recovered by 7pm on Sunday experienced slower than expected restoration times. AWS technicians were forced to manually recover the remaining instances, which stretched into Monday morning.

Lost data

AWS indicated a "small number" of Elastic Block Storage (EBS) customers may lose data as a result of the outage.

It put the figure at less than 0.01 percent of the instance volumes in this particular availability zone.

Hard drives failed in a "small number" of storage servers during the outage and did not automatically recover, the firm said.

"In cases where both of the replicas were hosted on failed servers, we were unable to automatically restore the volume," AWS said.

"After the initial wave of automated recovery, the EBS team focused on manually recovering as many damaged storage servers as possible. This is a slow process, which is why some volumes took much longer to return to service."

Where to from here

The cloud company pledged to improve its power configuration in the affected facility to ensure future power sags would not bring down its infrastructure.

It said it would add extra breakers so connections to degraded utility power could be broken more quickly, allowing generators time to activate before UPS power runs out.

"Additionally, we will be taking actions to improve our recovery systems," AWS wrote.

"The first is to fix the latent issue that led to our recovery systems not being able to automatically recover a subset of customer instances. That fix is already in testing, and will be deployed over the coming days. We will also be starting a program to regularly test our recovery processes on unoccupied, long-running hosts in our fleet."

Changes will also be made to its APIs to harden them against failure, AWS said without specifying. It said it expected the changes would be implemented in Sydney next month.

"We apologise for any inconvenience this event caused. We know how critical our services are to our customers’ businesses," the firm wrote.

"We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and use it to drive improvement across our services. "

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

aws outage software

Partner Content

Partner Content Take control of your connectivity with Telstra’s Adaptive Networks Centre

Partner Content Why Backing Up Your Microsoft 365 Data Is Only Half the Job

Promoted Content Onel Consulting Strengthens Its White-Glove Services With Strategic COO Appointment

Partner Content Cyber resilience rises up the board agenda as attacks intensify: Infotrust

Events

Most Read Articles

Australia Post's future IT estate to rely on 13 "platform ecosystems"

Impact Awards: Tecala slashes customer response times for fintech IQumulate

Interactive introduces private cloud platform

Digital61 expands cybersecurity portfolio

CBA looks to AI for workforce planning

PsiQuantum to build computer at Moreton Bay

US to invest in IBM, other quantum computing firms

Singtel open to selling "meaningful minority stake" in Optus

Fears of unfettered hacking spurred by Anthropic's Mythos AI model likely 'overstated'

Failure in power redundancy triggered AWS Sydney outage