As Microsoft Exchange 2016 and 2019 Sunset, How Can Privacy-Conscious Organisations Future-Proof their Email?

Cloudflare black-holed its own traffic for an hour

By Richard Chirgwin

Jun 22 2022 7:56AM

BGP slip took 19 data centres offline.

Cloudflare has attributed an hour-long outage yesterday to a BGP error that made 19 of its data centres invisible to the Internet.

Cloudflare black-holed its own traffic for an hour

The company has published a post-mortem of the outage, which was caused by a BGP advertisement that accidentally withdrew route announcements for the affected data centres.

“Unfortunately, these 19 locations handle a significant proportion of our global traffic,” the company said.

“This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations.

“We are very sorry for this outage. This was our error and not the result of an attack or malicious activity."

The company’s timeline shows that the outage began at 6.27am UTC (4.27pm AEST) on June 21, and the case was closed at 8.00 UTC.

As the post explained, Cloudflare has undertaken an 18 month project to convert its busiest data centres to a “more flexible and resilient architecture” it has dubbed “Multi-Colo PoP” (MCP).

Locations using that architecture include Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, and Tokyo.

BGP the culprit

MCP locations use routing instructions that create a mesh of connections, and those routing instructions are carried in the venerable Internet standard called the Border Gateway Protocol (BGP).

Among other things, BGP lets operators define policies governing which IP address prefixes are advertised by routers to their peers, and which peers routers will accept advertisements from.

As the post explained: “These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised.

"A change in policy can mean a previously advertised prefix is no longer advertised, known as being ‘withdrawn’, and those IP addresses will no longer be reachable on the Internet.”

And that’s where Cloudflare’s MCP rollout went wrong: “While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.”

That accidental change made spine routers unreachable over the Internet, making it initially difficult for Cloudflare’s engineers to access them and reverse the change.

The post highlighted how critical the affected locations are: “Even though these locations are only four percent of our total network, the outage impacted 50 percent of total [HTTP] requests.”

As well as making the affected locations invisible to the Internet, there was one more side-effect of the accidental configuration change: it disabled the company’s internal load balancing system.

“This meant that our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload," it said.

The company said it will work on its processes, architecture, and automation to avoid a repeat of the incident.

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

Partner Content

Partner Content ElasticON Sydney 2025: Deriving value from your data with Search AI

Partner Content Ransomware targets Australian SME false sense of security

Unlock SMB Success with Microsoft Copilot

Partner Content Logicalis APAC CIO Report: The CIO’s 2025 Mandate

Events

Most Read Articles

NBN Co to "rationalise" some access technologies entirely

"It's an exciting time to be part of the health and aged care sector"

Insicon founder Matt Miller on the coming 'tsunami' of compliance and educating boards about cyber security

Orro claims Australia first with managed digital asset discovery service

As Microsoft Exchange 2016 and 2019 Sunset, How Can Privacy-Conscious Organisations Future-Proof their Email?

Microsoft to cut about four percent of jobs amid hefty AI bets

Google offers new proposal to stave off EU antitrust fine

Defence commits to five more years of Azure worth $495m

El Jannah backs Salesforce martech stack to support store expansion

Cloudflare black-holed its own traffic for an hour