Reverse proxy and content delivery network Cloudflare has revealed that a coding fumble was behind yesterday's half hour global outage, not an attack as some had speculated.
Cloudflare chief technology officer John Graham-Cumming posted an initial brief post-mortem about the outage, saying the company is "incredibly sorry that this incident occurred".
Graham-Cumming said outage that started just before midnight Australian time was caused by a single misconfigured rule within the company's web application firewall (WAF).
The rule contained a regular expression used to catch and act on certain conditions, which caused processor usage to spike to 100 percent on Cloudflare's machines worldwide, a problem the company's engineers had not experienced before.
Thanks to the excessive CPU utilisation, customers received 502 Bad Gateway errors when they tried to access sites proxied by Cloudflare.
While the rules were deployed in simulated mode to identify and log issues, with no customer traffic being logged so as to measure false positives, the rogue regex meant traffic dropped 82 per cent worldwide until Cloudflare issued a global kill to remove the WAF rulesets.
Graham-Cumming said incidents like the above one are "very painful for our customers" and added that Cloudflare's testing procedures were insufficient in this case.
He promised a review and changes to Cloudflare's testing and deployment process to avoid future incidents of the same kind.