Coding typo caused massive AWS outage

By Allie Coyne

Mar 3 2017 8:37AM

Dev botched input during maintenance.

The large-scale outage that hit Amazon Web Services customers this week was caused by a staffer entering an incorrect input during maintenance that resulted in the removal of a large number of servers.

In its post mortem of the incident published today, AWS revealed the bungle occured during debugging of a problem in its S3 billing system.

An S3 team member was attempting to execute a command that would remove a small set of servers for one of the S3 subsystems used by the billing system.

Instead, the input was entered incorrectly, causing more servers than intended to be removed.

"The servers that were inadvertently removed supported two other S3 subsystems," AWS said.

One of the two affected index subsystems manages the metadata and location information for all S3 objects in the Virginia data centre region.

"This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests," AWS said.

"The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects."

As a result AWS was forced to do a full restart of the affected systems, during which time S3 was unavailable.

Other services in the region that rely on S3 for storage - like the S3 dashboard, new EC2 instances, EBS volumes, and Lambda - were also impacted.

The problem was exacerbated by the fact that AWS has not done a full restart of the index and placement subsystems in its larger regions for many years.

It meant that even though the subsystems are designed to keep working with minimal customer impact when capacity fails, the restart process took more time than it should.

"S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."

AWS said it has modified the ability for "too much capacity to be removed too quickly".

Capacity can now only be removed slowly, and safeguards have been added to "prevent capacity from being removed when it will take any subsystem below its minimum required capacity level". AWS is similarly going through other operational tools to make sure the same safety checks are in place.

It apologised for the effect the outage had on its customers.

"While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses," the company wrote.

"We will do everything we can to learn from this event and use it to improve our availability even further."