A lack of detail on the root cause of Friday’s Microsoft Office365 outage has even the strongest advocates of cloud computing concerned the vendor isn’t up to the task of securing online services.
The outage, which Microsoft claims only to have impacted customers for around four hours, took out global Office365, Hotmail and SkyDrive services.
Microsoft has had four days to provide a post-incident report, but has only provided the briefest of statements to explain what went wrong.
“On Thursday, September 8th at approximately 8 p.m. PDT, Microsoft became aware of a Domain Name Service (DNS) problem causing service degradation for multiple cloud-based services.
A tool that helps balance network traffic was being updated, and for a currently unknown reason, the update did not work correctly. As a result, the configuration was corrupted, which caused service disruption.
Service restoration began at approximately 10:30 p.m. PDT, with full service restoration completed at approximately 11:30 p.m. PDT. We are continuing to review the incident.”
Microsoft's statement is nowhere as detailed as Amazon Web Services’ post-incident report when it suffered an outage in April.
Missing is information on why global services were affected – despite Microsoft’s promise of regional availability zones – and what steps it would take to ensure the incident is never repeated.
IT engineers discussing the outage with iTnews said it is perfectly feasible that Microsoft technicians did indeed break the load distribution system at a central location, from where the service is distributed globally.
But this doesn't explain why Microsoft's first response was to attribute the outage to a power failure in a post that was pulled within an hour.
In the vacuum of information around the outage, one hacking group has been in contact with SC Magazine Australia claiming responsibility for deleting Microsoft’s DNS records. The group is yet to provide the publication any concrete evidence (such as logs) of its involvement.
Microsoft MVP Wayne Small, owner of small business server resource SBSFAQ.com, said it was nonetheless of great concern that Microsoft’s own DNS (Domain Name Service) records – an essential element of its online services – could have been corrupted or deleted.
“DNS is the root of the internet – we rely on it to resolve domain names to IP addresses," Small said. "It is an intrinsic part of the design of DNS that it should still work if a single server goes down.
“It could be that, as Microsoft says, an update corrupted these DNS records. But it could just as well be some mischievous attacker deleting them.
"If somebody out there is able to kill DNS records, we better watch out. I would prefer to think Microsoft screwed up when updating their tool.”
Justin Warren, managing director at PivotNine said it was hard to be able to read into the outage without an intimate knowledge of Microsoft’s architecture.
“Perhaps Microsoft’s infrastructure is not as distributed as it should be,” he said.
But he does hold some doubts about why a hacking group would attack DNS when a DDoS attack on the service itself would be so much easier and equally effective.
Either way, the speculation could be remedied with a more detailed post-incident report.
“Why hasn’t Microsoft come clean?” Small asked.
“Microsoft’s explanation is nowhere near as detailed as what Google provided [for an hour-long Google Docs outage last week]. I’m a little concerned about that.
"Microsoft hasn’t given customers a clear understanding of just what plans are in place to make sure this doesn’t happen again.”