'Remote hands' flub takes out much of Cloudflare

By on
'Remote hands' flub takes out much of Cloudflare

Multi-hour outage after maintenance mistake.

Web infrastructure and content delivery provider Cloudflare is experiencing serious disruption to several of its sites and services this morning, after losing network connectivity to a core data centre.

Full details are yet to be revealed, but the problems started around 1.30 am AEST, and are caused by "a disruption that occurred during a maintenance," Cloudflare said on its status page which is still accessible. 

Cloudflare founder and chief executive Matthew Prince confirmed the issues, blaming it on "remote hands".

At the time of initial publishing, the Cloudflare Dashboard was inaccessible for customers. (See update below).

Apart from the Dashboards, Cloudflare's API, registrar for domain names, Argo encrypted tunnel, billing, secure sockets layer certificates and for software-as-a-service provisioning, enterprise logs, domain name service updates, and content delivery network cache purges all went offline due to the problem.

Cloudflare Workers, Storage, Spectrum, Stream, Load Balancing and Argo smart routing services continue to operate, but with degraded performance.

Some customers are reporting that the fault has left sites hosted via Cloudflare inaccessible.

Update 

“This never should have happened,” said Cloudflare’s chief executive Matthew Prince a few hours after the outage.

Prince said the plan was to decommission a rack of equipment in a data centre that was supposed to be redundant.

However, while the unnamed equipment was indeed redundant, the same cabinet contained a network lead patch panel that was critical.

“Its [the patch panel] removal caused multiple independent network connections to fail,” Prince said.

Compounding the network connection failures, Cloudflare decided not to cut over to a back up facility out of technical concerns, and in the belief that the engineers could get the primary facility back online faster than it turned out they could, he added. 

Cloudflare will conduct a full post-mortem of the incident and make it public on its blog, once its services are fully restored and the company understands the mistakes it made.

As of this update, Cloudflare’s status page says all systems are operational.

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:

Most Read Articles

You must be a registered member of iTnews to post a comment.
| Register

Log In

Username / Email:
Password:
  |  Forgot your password?