Web infrastructure and content delivery provider Cloudflare is experiencing serious disruption to several of its sites and services this morning, after losing network connectivity to a core data centre.
Full details are yet to be revealed, but the problems started around 1.30 am AEST, and are caused by "a disruption that occurred during a maintenance," Cloudflare said on its status page which is still accessible.
Cloudflare founder and chief executive Matthew Prince confirmed the issues, blaming it on "remote hands".
During planned maintenance remote hands decommissioned some equipment that they shouldn’t have. We’re failing over to a backup facility and working to get the equipment back online. Not attack related. Doesn’t impact performance of the network.— Matthew Prince (@eastdakota) April 15, 2020
At the time of initial publishing, the Cloudflare Dashboard was inaccessible for customers. (See update below).
Apart from the Dashboards, Cloudflare's API, registrar for domain names, Argo encrypted tunnel, billing, secure sockets layer certificates and for software-as-a-service provisioning, enterprise logs, domain name service updates, and content delivery network cache purges all went offline due to the problem.
Cloudflare Workers, Storage, Spectrum, Stream, Load Balancing and Argo smart routing services continue to operate, but with degraded performance.
“This never should have happened,” said Cloudflare’s chief executive Matthew Prince a few hours after the outage.
Prince said the plan was to decommission a rack of equipment in a data centre that was supposed to be redundant.
However, while the unnamed equipment was indeed redundant, the same cabinet contained a network lead patch panel that was critical.
“Its [the patch panel] removal caused multiple independent network connections to fail,” Prince said.
Compounding the network connection failures, Cloudflare decided not to cut over to a back up facility out of technical concerns, and in the belief that the engineers could get the primary facility back online faster than it turned out they could, he added.
Cloudflare will conduct a full post-mortem of the incident and make it public on its blog, once its services are fully restored and the company understands the mistakes it made.
As of this update, Cloudflare’s status page says all systems are operational.