IBM reveals how it broke its own cloud

By Simon Sharwood

Aug 24 2018 6:50AM

Shut down servers by mistake... and then things got worse.

When you're selling the big cloud dream, it pays to come clean to customers about how and why things sometimes go pain fully wrong.

In mid-July 2018 IBM’s cloud suffered a three-hour incident that saw its consoles either perform painfully slowly or become unavailable. And now the company has now revealed why.

An “Incident RFO” (reason for outage) document sent to customers explained that “while conducting maintenance on internal systems, IBM Cloud Engineers inadvertently restarted some Virtual Server Instances (VSIs) which were hosting applications for customer accounts.”

Clouds are supposed to be rather more resilient than that, so the start of the incident signals a certain immaturity in the Big Blue cloud.

The explanation continues “once the VSIs completed their restarts, several of the services on the VSIs did not automatically restart and were preventing customers from accessing their accounts, logging in to their consoles or receiving time out errors.”

Again, this is not the kind of thing one expects from a top-tier cloud.

And the explanation gets even worse for IBM, revealing that “IBM Cloud Engineers began investigating and determined several VSIs were attempting to connect to the same API server which had an improper configuration.”

Which sounds like IBM’s cloud all-but-DdoSed itself - and did so to a server that both wasn’t ready to handle the traffic coming its way and didn’t failover to another resource.

IBM staffers eventually figured things out and restored service. Fortunately, the incident took place on a Sunday, US time, so many users would have been spared the problem.

“To prevent future instances,” the RFO concludes, “IBM Cloud Engineers have updated the build documentation and have instituted code to prevent all VSIs from restarting at the same time to prevent the API server from becoming overloaded.”

“We sincerely apologize for any inconvenience that this incident may have caused” the sign off on the incident report said.

Big Blue’s cloud does have impressive global presence, but analysts rate it as lagging rivals in terms of both features and suitability for enterprise use.

A major rebuild of the cloud is in the works, but is many months behind schedule. Mistakes such as those that led to this incident must surely therefore be very unwelcome as IBM prioritises cloud products.

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

cloud ibm networking

Partner Content

Promoted Content AI voice agents and the human touch: A new playbook for SME customer engagement

Partner Content Delivering healthcare AI success with sound foundations and a systemic approach

Partner Content AI agents are reshaping identity governance, and attackers are already exploiting the gap

Why AI governance matters at scale

Events

Most Read Articles

UNSW faces a 95 percent cut to its M365 storage by October

Impact Awards: Tecala slashes customer response times for fintech IQumulate

Interactive introduces private cloud platform

Digital61 expands cybersecurity portfolio

UNSW faces a 95 percent cut to its M365 storage by October

Google disrupts NetNut proxy network

Amazon to start initial Leo internet service this year

OAIC ordered to turn over Amex privacy determination in full

KFC Australia operator Collins Foods eyes AI for margin improvements

IBM reveals how it broke its own cloud