Microsoft has offered some of its cloud customers a one-month, 33 percent discount in apology for a 34-hour outage to Windows Azure on February 29.
The software giant failed to account for the leap day in code it used to create encryption certificates for the cloud service, causing systems to fail from midnight, Greenwich Mean Time (11am AEDT).
The one-year transfer certificates, which allow customers' virtual applications to communicate with the Windows Azure platform, failed when the system rejected new certificates with an expiry date set for February 29, 2013 as “invalid”.
As a result, all Azure Compute, Access Control, Service Bus and Caching customers would receive a 33 percent discount for the next billing month, regardless of whether their service was affected.
“Microsoft recognizes that this outage had a significant impact on many of our customers,” Microsoft’s corporate vice president for cloud, Bill Laing, explained in a blog post on Friday.
“We will continue to spend time to fully understand all of the issues outlined above and over the coming days and weeks we will take steps to address and mitigate the issues to improve our service.”
What went wrong
Microsoft’s systems assumed a hardware issue was at fault when the certificate creation process failed three times on February 29, prompting the transfer of workloads from the failed server to other servers.
The bug worked its way through Windows Azure clusters until engineers identified the bug, more than two-and-a-half hours later.
By 12.23pm AEDT on March 1, Microsoft had applied a software fix to most of its server clusters. But the fix failed to repair seven clusters, resulting in a “secondary outage” that lasted until 9.15am the following day.
The issue affected customers of Windows Azure Compute, Access Control Service, Windows Azure Service Bus, SQL Azure Portal and Data Sync Services.
Windows Azure Storage and SQL Azure were not affected, Laing wrote.