Amazon Web Services has blamed the rapid growth of customers using a new feature in its DynamoDB database product for the recent global outage that disrupted services for some of the internet's biggest properties.
Early on Sunday, AWS broke down and cut off access to sites including Netflix, AirBnB, Tinder and IMDB, among others, as well as its own instant video and book sites, for a number of users.
At the time, it noted that issues with its DynamoDB database at its US-East data centre in Virgina were the cause.
The company today revealed the massive outage was caused by its overloaded internal metadata service in DynamoDB failing to answer queries from its storage systems within a particular time limit.
DynamoDB stores and maintain tables for customers, which are separated into partitions, each containing a portion of the table’s data. AWS spreads the partitions onto many servers, and groups partitions on a server into a membership.
These memberships are managed by DynamoDB's internal metadata service. The actual table data is held on storage servers, which every so often need to confirm they have the correct membership - at which point they ping the metadata service to confirm their memberships records are accurate.
In the early hours of Sunday, AWS said the metadata service was not responding to the storage servers within the specified timeframe, causing the servers to stop handling requests from customers.
It blamed the delay on AWS customers using Global Secondary Indexes in their databases, which increase the size of the partition membership information for a table - meaning the DynamoDB metadata systems struggled to take in all the extra data.
Too many large requests hitting AWS servers at the same time caused a delay in response time, resulting in the storage systems ceasing to handle requests from customers for data.
The strain on the metadata service also meant AWS engineers were unable to add capacity to the service, as it locked them out of sending administrative commands.
They decided to pause requests to the metadata service so they could make changes to the service and relieve the load.
Once the service was able to respond to the engineers' administrative commands, they could add capacity and reactive requests to the metadata service and allow the storage servers to start taking customer requests again.
Several hours later, AWS said, DynamoDB was mostly back in operation.
The engineers said in order to avoid a similar crash reoccuring, they had significantly increased the capacity of the metadata service and were implementing stricter monitoring on performance dimensions like membership size.
They are also reducing the rate at which storage servers request membership data, and lengthening the time allowed to process queries, the engineers wrote.
"Finally and longer term, we are segmenting the DynamoDB service so that it will have many instances of the metadata service each serving only portions of the storage server fleet. This will further contain the impact of software, performance/capacity, or infrastructure failures," they said.
"We apologise for the impact to affected customers.
"We know how critical this service is to customers, both because many use it for mission-critical operations and because AWS services also rely on it. For us, availability is the most important feature of DynamoDB, and we will do everything we can to learn from the event and to avoid a recurrence in the future."
However, the company again today reported increased API error rates for the DynamoDB service in the US-East data centre.
It did not detail the cause of the problem but said it was rolling out the remaining mitigations that developed to mitigate the errors encountered on Sunday to address the issue.
The problems lasted for several hours hours and appear to have been resolved.
"The mitigations deployed up to this point have stabilised the service and we will continue to rollout additional mitigations as described in the summary of events from earlier in the week," AWS said.