Google sorry for widespread Docs outage

By
Follow google news

Real time bug drains Google memory.

Google has attributed an hour-long outage of its Docs service last Wednesday to a service upgrade designed to improve real-time collaboration.

Google sorry for widespread Docs outage

"We feel your pain and are very sorry," Alan Warren, Google’s engineering director, advised in a blog post Friday, explaining why the “majority” of Docs customers were unable to access document lists, documents, drawings and Apps Scripts between 2:02PM to 3:18PM Pacific Daylight Time on Wednesday 7 September.  

While the outage officially only lasted an hour, according to Google's Apps Status Dashboard, users began reporting problems late Tuesday evening.

No Docs data was lost in the incident, according to Google’s incident report [PDF], however some edits made immediately prior to the outage may not have been saved.  

Google's attempt to improve collaboration features of Docs lists exposed a memory management bug that affected the “look up” machines used to monitor and execute modifications to a Google Doc. 

The update “placed additional load on the service that manages the distribution of Docs processing” but the bug “accelerated and compounded” the load. 

“[T]he lookup machines didn’t recycle their memory properly after each lookup, causing them to eventually run out of memory and restart,” said Warren. 

The bug’s impact - measured by the rate at which its servers failed to look up documents - escalated “sharply” within a minute of Google’s monitoring systems picking up the fault. 

“The engineering teams diagnosed the problem, determined that it was correlated with the feature change, and started rolling it back 23 minutes after the first alert. In parallel, we doubled the capacity of the lookup service to mitigate the impact of the memory management bug,” said Warren. 

The scale of Google's outage was overshadowed by yet another outage to Microsoft's Office 365 and Hotmail last Friday, believed to have been caused by a power failure in Mexico.     

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

WA gov inks $73m deal for core digital twin platform

WA gov inks $73m deal for core digital twin platform

National photo licence recognition system set to go live in 2025

National photo licence recognition system set to go live in 2025

Qld lifts 12-year ban on IBM after $1.25bn payroll failure

Qld lifts 12-year ban on IBM after $1.25bn payroll failure

Macquarie Bank on board with Google Gemini

Macquarie Bank on board with Google Gemini

Log In

  |  Forgot your password?