When Gmail and Google Drive browned out last week, it drew a lot of attention to a big Google SNAFU. But what went unremarked-upon was that Google has actually had a horror fortnight, with errors-a-plenty across multiple services.
The not-very-much-fun kicked off on March 5th with an incident that caused virtual machine connectivity issues across the northern hemisphere.
Then came the cloud storage outage, all four hours of it.
Google’s published root cause analyses for a few of the outages and they reveal that these outages were mostly the company’s fault.
The cloud storage outage was caused by “a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead [SIC] to a cascading failure.”
The Cloud Console crash was explained as “a code change in the most recent release of the quota system introduced a bug, causing a fallback to significantly smaller, default quota limits, resulting in user requests being denied.”
The App Engine problem looks to have been caused by the storage incident.
Google publishes verbose incident reports, unlike some of its rivals, and is brave to do so given its market share trails AWS and Microsoft. But perhaps its willingness to be so frank about its failures shows why it’s not as well-regarded by buyers.