Human error caused Microsoft Azure outage

By Juha Saarinen

Dec 18 2014 6:11AM

Post-mortem analysis identifies gaps in deployment processes.

Mistakes by engineers and gaps in the enforcement of deployment policies were behind the worldwide outage of Microsoft's Azure cloud platform in November this year, according to a detailed mea culpa analysis by the software giant.

Human error caused Microsoft Azure outage

The outage saw customers in multiple regions being unable to connect to several services such as Azure storage, virtual machines, websites, Active Directory and the management portal for several hours.

A final root cause analysis (RCA) published by Azure team member Jason Zander said the intention was to deploy a software change to improve performance and reduce processor utilisation of the storage table front-ends system.

Initial testing showed the fix did indeed bump up performance. But when the software change was deployed in the Azure production environment, however, things went wrong in two areas.

An unnamed engineer assumed that because the fix had already been "flighted" on a portion of the Azure production infrastructure, to enable it across the rest of the cloud platform would be low risk.

Microsoft's standard policy is to incrementally deploy changes across small slices of the production environment, but the configuration tooling did not adequately enforce this. The company will from now on enforce that policy in the deployment platform itself.

A second mistake led to the software change being wrongly enabled on Azure Blob (binary large object) storage front-ends when it had only been tested against table storage front-ends.

This exposed a bug that caused some Blob storage front-ends being stuck in infinite loops, and ceasing to respond to requests, Zander wrote.

After the software change had been rolled back, some virtual machines on Azure required manual recovery. This was due to disk mount time-out errors during boot up, in some cases caused by high load on the storage service during the recovery phase.

Other Windows VMs provisioned and created when the storage service interruption took place failed in setup. Furthermore, a network programming error led to a small percentage of VMs being inaccessible for remote management through the public internet protocol address.

The company has deployed fixes on Azure to prevent the VM service from being interrupted in this manner in the future.

Microsoft also chastised itself for poor communications during the outage, saying there were delays to displaying and wrong information presented on the Azure service health dashboard as well as slow response from the company's official support.

Channels of communication such as tweets by the @Azure account and the Azure blogs were also insufficient, leaving customers with not enough information during the interruptoion, Zander wrote.

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

Audit Office of NSW and Data61 explore AI for gov auditing

Services Australia may get powers to rein in data breach exposure

Samsung: 98,000 handsets with triple zero call issues still 'active'

Private 5G powers data-driven mining

Macquarie Group unveils Dexd, its "developer experience daemon"

Human error caused Microsoft Azure outage

Post-mortem analysis identifies gaps in deployment processes.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

VMware and Melbourne IT launch cloud service

Victoria's first government tech chief steps down

Defence picks Lockheed Martin for mammoth compute deal

Microsoft had three staff at Australian data centre campus when Azure went out

Most popular tech stories

Virgin Australia, Wesfarmers strike OpenAI agreements

Meta to cut up to 30 percent of metaverse budget

South32 upgrades its onboarding with SuccessFactors

CBA finds its first chief AI officer

Accenture to train 30,000 staff on Anthropic's Claude

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

'Touch-free' smartphone controlled with head movements

Axis Communications opens experience centre in Sydney tech hub

Photos: Australian industry explores data for net zero

Photos: The 2024 IoT Awards winners

Audit Office of NSW and Data61 explore AI for gov auditing

Services Australia may get powers to rein in data breach exposure

Samsung: 98,000 handsets with triple zero call issues still 'active'

Private 5G powers data-driven mining

Macquarie Group unveils Dexd, its "developer experience daemon"

Human error caused Microsoft Azure outage

Post-mortem analysis identifies gaps in deployment processes.

Partner Content

Sponsored Whitepapers

Events

Most Read Articles

VMware and Melbourne IT launch cloud service

Victoria's first government tech chief steps down

Defence picks Lockheed Martin for mammoth compute deal

Microsoft had three staff at Australian data centre campus when Azure went out

Most popular tech stories

Virgin Australia, Wesfarmers strike OpenAI agreements

Meta to cut up to 30 percent of metaverse budget

South32 upgrades its onboarding with SuccessFactors

CBA finds its first chief AI officer

Accenture to train 30,000 staff on Anthropic's Claude

HamiltonJet partners with digital services provider Fortude

SentinelOne signs distribution agreement with Sektor

Rapid7’s new SIEM combines exposure management with threat detection

The techpartner.news podcast, episode 3: Why security consultancy founder Kat McCrabb started with the hard stuff

Bluechip Infotech enters final stage of Goodson Imports acquisition

Blackberry celebrates "giant step forward"

'Touch-free' smartphone controlled with head movements

Axis Communications opens experience centre in Sydney tech hub

Photos: Australian industry explores data for net zero

Photos: The 2024 IoT Awards winners

Log In