IBM reveals how it broke its own cloud

By

Shut down servers by mistake... and then things got worse.

When you're selling the big cloud dream, it pays to come clean to customers about how and why things sometimes go pain fully wrong. 

IBM reveals how it broke its own cloud

In mid-July 2018 IBM’s cloud suffered a three-hour incident that saw its consoles either perform painfully slowly or become unavailable. And now the company has now revealed why.

An “Incident RFO” (reason for outage) document sent to customers explained that “while conducting maintenance on internal systems, IBM Cloud Engineers inadvertently restarted some Virtual Server Instances (VSIs) which were hosting applications for customer accounts.”

Clouds are supposed to be rather more resilient than that, so the start of the incident signals a certain immaturity in the Big Blue cloud.

The explanation continues “once the VSIs completed their restarts, several of the services on the VSIs did not automatically restart and were preventing customers from accessing their accounts, logging in to their consoles or receiving time out errors.”

Again, this is not the kind of thing one expects from a top-tier cloud.

And the explanation gets even worse for IBM, revealing that “IBM Cloud Engineers began investigating and determined several VSIs were attempting to connect to the same API server which had an improper configuration.”

Which sounds like IBM’s cloud all-but-DdoSed itself - and did so to a server that both wasn’t ready to handle the traffic coming its way and didn’t failover to another resource.

IBM staffers eventually figured things out and restored service. Fortunately, the incident took place on a Sunday, US time, so many users would have been spared the problem.

“To prevent future instances,” the RFO concludes, “IBM Cloud Engineers have updated the build documentation and have instituted code to prevent all VSIs from restarting at the same time to prevent the API server from becoming overloaded.”

“We sincerely apologize for any inconvenience that this incident may have caused” the sign off on the incident report said.

Big Blue’s cloud does have impressive global presence, but analysts rate it as lagging rivals in terms of both features and suitability for enterprise use.

A major rebuild of the cloud is in the works, but is many months behind schedule. Mistakes such as those that led to this incident must surely therefore be very unwelcome as IBM prioritises cloud products.

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

CBA looks to GenAI to assist 1200 'security champions'

CBA looks to GenAI to assist 1200 'security champions'

ASD signs $70 million AWS cloud contract

ASD signs $70 million AWS cloud contract

Microsoft planning thousands of job cuts aimed at salespeople

Microsoft planning thousands of job cuts aimed at salespeople

AWS to expand data centres in Sydney and Melbourne

AWS to expand data centres in Sydney and Melbourne

Log In

  |  Forgot your password?