Google has attributed an hour-long outage of its Docs service last Wednesday to a service upgrade designed to improve real-time collaboration.
"We feel your pain and are very sorry," Alan Warren, Google’s engineering director, advised in a blog post Friday, explaining why the “majority” of Docs customers were unable to access document lists, documents, drawings and Apps Scripts between 2:02PM to 3:18PM Pacific Daylight Time on Wednesday 7 September.
While the outage officially only lasted an hour, according to Google's Apps Status Dashboard, users began reporting problems late Tuesday evening.
No Docs data was lost in the incident, according to Google’s incident report [PDF], however some edits made immediately prior to the outage may not have been saved.
Google's attempt to improve collaboration features of Docs lists exposed a memory management bug that affected the “look up” machines used to monitor and execute modifications to a Google Doc.
The update “placed additional load on the service that manages the distribution of Docs processing” but the bug “accelerated and compounded” the load.
“[T]he lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart,” said Warren.
The bug’s impact - measured by the rate at which its servers failed to look up documents - escalated “sharply” within a minute of Google’s monitoring systems picking up the fault.
“The engineering teams diagnosed the problem, determined that it was correlated with the feature change, and started rolling it back 23 minutes after the first alert. In parallel, we doubled the capacity of the lookup service to mitigate the impact of the memory management bug,” said Warren.
The scale of Google's outage was overshadowed by yet another outage to Microsoft's Office 365 and Hotmail last Friday, believed to have been caused by a power failure in Mexico.