Mistakes by engineers and gaps in the enforcement of deployment policies were behind the worldwide outage of Microsoft's Azure cloud platform in November this year, according to a detailed mea culpa analysis by the software giant.
The outage saw customers in multiple regions being unable to connect to several services such as Azure storage, virtual machines, websites, Active Directory and the management portal for several hours.
A final root cause analysis (RCA) published by Azure team member Jason Zander said the intention was to deploy a software change to improve performance and reduce processor utilisation of the storage table front-ends system.
Initial testing showed the fix did indeed bump up performance. But when the software change was deployed in the Azure production environment, however, things went wrong in two areas.
An unnamed engineer assumed that because the fix had already been "flighted" on a portion of the Azure production infrastructure, to enable it across the rest of the cloud platform would be low risk.
Microsoft's standard policy is to incrementally deploy changes across small slices of the production environment, but the configuration tooling did not adequately enforce this. The company will from now on enforce that policy in the deployment platform itself.
A second mistake led to the software change being wrongly enabled on Azure Blob (binary large object) storage front-ends when it had only been tested against table storage front-ends.
This exposed a bug that caused some Blob storage front-ends being stuck in infinite loops, and ceasing to respond to requests, Zander wrote.
After the software change had been rolled back, some virtual machines on Azure required manual recovery. This was due to disk mount time-out errors during boot up, in some cases caused by high load on the storage service during the recovery phase.
Other Windows VMs provisioned and created when the storage service interruption took place failed in setup. Furthermore, a network programming error led to a small percentage of VMs being inaccessible for remote management through the public internet protocol address.
The company has deployed fixes on Azure to prevent the VM service from being interrupted in this manner in the future.
Microsoft also chastised itself for poor communications during the outage, saying there were delays to displaying and wrong information presented on the Azure service health dashboard as well as slow response from the company's official support.
Channels of communication such as tweets by the @Azure account and the Azure blogs were also insufficient, leaving customers with not enough information during the interruptoion, Zander wrote.