When you're selling the big cloud dream, it pays to come clean to customers about how and why things sometimes go pain fully wrong.
In mid-July 2018 IBM’s cloud suffered a three-hour incident that saw its consoles either perform painfully slowly or become unavailable. And now the company has now revealed why.
An “Incident RFO” (reason for outage) document sent to customers explained that “while conducting maintenance on internal systems, IBM Cloud Engineers inadvertently restarted some Virtual Server Instances (VSIs) which were hosting applications for customer accounts.”
Clouds are supposed to be rather more resilient than that, so the start of the incident signals a certain immaturity in the Big Blue cloud.
The explanation continues “once the VSIs completed their restarts, several of the services on the VSIs did not automatically restart and were preventing customers from accessing their accounts, logging in to their consoles or receiving time out errors.”
Again, this is not the kind of thing one expects from a top-tier cloud.
And the explanation gets even worse for IBM, revealing that “IBM Cloud Engineers began investigating and determined several VSIs were attempting to connect to the same API server which had an improper configuration.”
Which sounds like IBM’s cloud all-but-DdoSed itself - and did so to a server that both wasn’t ready to handle the traffic coming its way and didn’t failover to another resource.
IBM staffers eventually figured things out and restored service. Fortunately, the incident took place on a Sunday, US time, so many users would have been spared the problem.
“To prevent future instances,” the RFO concludes, “IBM Cloud Engineers have updated the build documentation and have instituted code to prevent all VSIs from restarting at the same time to prevent the API server from becoming overloaded.”
“We sincerely apologize for any inconvenience that this incident may have caused” the sign off on the incident report said.
Big Blue’s cloud does have impressive global presence, but analysts rate it as lagging rivals in terms of both features and suitability for enterprise use.
A major rebuild of the cloud is in the works, but is many months behind schedule. Mistakes such as those that led to this incident must surely therefore be very unwelcome as IBM prioritises cloud products.