Google has attributed its network problems on Monday morning to a config change that was “incorrectly applied” to too many servers in multiple regions.
The problems were most keenly felt on the US east coast but did start to fan out to worldwide operations including Australia, where some users had problems accessing Gmail and Google Assistant.
While a thorough post-mortem is still being conducted by Google’s engineering teams, the company provided a preliminary explanation in a blog post.
“In essence, the root cause of [the] disruption was a configuration change that was intended for a small number of servers in a single region,” Google said.
“The configuration was incorrectly applied to a larger number of servers across several neighbouring regions, and it caused those regions to stop using more than half of their available network capacity.
“The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not.”
Recognising large-scale congestion, Google said its network then performed as intended, “correctly triaging the traffic overload and dropping larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows”.
Google said it detected the network problems “within seconds” but found it hard to correct the problem quickly due to the network congestion.
“The same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage,” the cloud giant said.
“The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelise restoration efforts.”
Google suggested that, overall, the effect of the problems was unevenly felt, but that it was still motivated to make its systems better.
“YouTube measured a 2.5 percent drop of views for one hour, while Google Cloud Storage measured a 30 percent reduction in traffic,” it said.
“Approximately one percent of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email.
“When we fall short, as happened [Monday Australian time], it motivates us to learn as much as we can, and to make Google’s services even better, even faster, and even more reliable.”
A preliminary internal post-mortem was already circulating within Google, though those employed by the company suggested it provided no more detail or succinct explanation of the outage than today’s blog post.