Failure in power redundancy triggered AWS Sydney outage

By on
Failure in power redundancy triggered AWS Sydney outage

Post-mortem of weekend problems.

The outage to an Amazon Web Services Sydney availability zone over the weekend following massive storms was caused by a failure in the company's uninterruptible power supply (UPS) setup.

On Sunday afternoon some of Australia's biggest web properties had their systems sent offline following severe storms in the region.

EC2 and EBS instances in one of AWS' availability zones became unreachable, and the power issues created flow-on problems for other services including Elastic Search, APIs and internal DNS.

In a post-mortem of the event published today, AWS revealed a diesel rotary uninterruptible power supply (DRUPS) had failed to properly switch to its reserve power when the utility power fell over.

DRUPS stores energy using utility power and dips into this reserve energy to power the data centre when the main line goes out, while it waits for the generator to be powered on.

But a failure at AWS' utility power provider over the weekend had resulted in an "unusually long voltage sag", meaning a set of breakers that isloate DRUPS from the utility power didn't open fast enough and the DRUPS' reserve energy drained fast into the power grid.

The DRUPS then shut down, meaning the generators which were in the process of starting up were unable to complete the process.

"DRUPS shutting down this rapidly and in this fashion is unusual and required some inspection," AWS said.

Every AWS instance is served by two independent power delivery line-ups that provide access to utility power, UPS, and generator back-ups, AWS said. If either of these lines has power, the instance will remain online.

However, in the weekend's outage, those that were affected lost access to both primary and secondary power after several power delivery line-ups failed to transfer load to generators because of the DRUPS glitch, AWS said.

It also revealed a bug in its instance management software meant instances that weren't recovered by 7pm on Sunday experienced slower than expected restoration times. AWS technicians were forced to manually recover the remaining instances, which stretched into Monday morning.

Lost data

AWS indicated a "small number" of Elastic Block Storage (EBS) customers may lose data as a result of the outage.

It put the figure at less than 0.01 percent of the instance volumes in this particular availability zone.

Hard drives failed in a "small number" of storage servers during the outage and did not automatically recover, the firm said.

"In cases where both of the replicas were hosted on failed servers, we were unable to automatically restore the volume," AWS said.

"After the initial wave of automated recovery, the EBS team focused on manually recovering as many damaged storage servers as possible. This is a slow process, which is why some volumes took much longer to return to service."

Where to from here

The cloud company pledged to improve its power configuration in the affected facility to ensure future power sags would not bring down its infrastructure.

It said it would add extra breakers so connections to degraded utility power could be broken more quickly, allowing generators time to activate before UPS power runs out.

"Additionally, we will be taking actions to improve our recovery systems," AWS wrote.

"The first is to fix the latent issue that led to our recovery systems not being able to automatically recover a subset of customer instances. That fix is already in testing, and will be deployed over the coming days. We will also be starting a program to regularly test our recovery processes on unoccupied, long-running hosts in our fleet."

Changes will also be made to its APIs to harden them against failure, AWS said without specifying. It said it expected the changes would be implemented in Sydney next month.

"We apologise for any inconvenience this event caused. We know how critical our services are to our customers’ businesses," the firm wrote.

"We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and use it to drive improvement across our services. "

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © . All rights reserved.

Most Read Articles

Log In

  |  Forgot your password?