The Amazon Web Services outage in North Virginia was caused by a software bug in an automated DNS management system that caused one automated component to delete another’s work.
The cloud provider published an extensive post-incident report late on Friday Australian time, shedding light on a disruption touted as the biggest for internet infrastructure in more than a year.
The post-incident report notes that “there were three distinct periods of impact to customer applications”, though the initial problems with DynamoDB are likely to be of most interest.
The official root cause attribution is given to be “a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that … automation failed to repair.”
The race condition, a type of software bug, involved “an unlikely interaction” between two of the same type of automated component in the DynamoDB DNS management architecture.
AWS said there are two distinct components in the architecture: a “DNS Planner, [which] … periodically creates a new DNS plan for each of the service’s endpoints”, and DNS Enactors that “pick up the latest plan” and systematically apply it to the endpoints.
“This process typically completes rapidly and does an effective job of keeping DNS state freshly updated,” AWS said.
AWS said DNS Enactors sometimes come in contact, usually unproblematically.
But, in this instance, one DNS Enactor “experienced unusually high delays, needing to retry its update on several of the DNS endpoints” while another Enactor picked up a newer plan and “rapidly” applied it to endpoints.
“The timing of these events triggered the latent race condition,” AWS said.
“When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked [a] clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them,” AWS said.
“At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DynamoDB endpoint, overwriting the newer plan.
“The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.
“As this plan was deleted, all IP addresses for the regional endpoint were immediately removed”.
AWS said that “manual operator intervention” was ultimately required to mitigate the incident.
As an immediate step, AWS said it has disabled both the “DNS Planner and the DNS Enactor automation worldwide”.
“In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans,” the cloud provider said.
The problems with DynamoDB in US-EAST-1 led to disruptions of other AWS cloud services that depend on it.
Problems with EC2 instances were caused by a subsystem that depended on DynamoDB to function being unable to reach the service; this failure caused flow-on impacts.
Other AWS services with dependency on DynamoDB also experienced issues during the incident.

iTnews Benchmark Security Awards 2025
Digital Leadership Day Federal
Government Cyber Security Showcase Federal
Government Innovation Showcase Federal
Digital NSW 2025 Showcase



