Cloudflare, one of the top web and internet services businesses, is having trouble again. In this storm, Cloudflare Dashboard and its related application programming interfaces (API) have gone down. The silver lining to this trouble is that these issues are not affecting the serving of cached files via the Cloudflare Content Delivery Network (CDN) or the Cloudflare Edge security features. No, those had gone down last week.
On Oct. 30. Cloudflare rolled out a failed update to its globally distributed key-value store, Workers KV. The result was that all of Cloudflare's services were down for 37 minutes. This new problem isn't nearly as serious, but it's been now been going on for over 24 hours. As one person on Ycombinatior put it, "I never experienced a longer than 12 hours outage with any service provider over my ~13 years career (maybe I was lucky). But thanks to Cloudflare, I have been able to enjoy not just one, but two ~24h outages in not even a month!"
As of 12:30 PM Eastern time, November 3, 2023. Cloudflare reported that it's still "working to restore impacted services."
Also: 10 ways to speed up your internet connection today
Cloudflare disclosed that the snags have affected a slew of products at the data plane and edge level. These include Logpush, WARP / Zero Trust device posture, Cloudflare dashboard, Cloudflare API, Stream API, Workers API, and Alert Notification System.
Other programs are still running, but you can't modify their settings. These are Magic Transit, Argo Smart Routing, Workers KV, WAF, Rate Limiting, Rules, WARP / Zero Trust Registration, Waiting Room, Load Balancing and Healthchecks, Cloudflare Pages, Zero Trust Gateway, DNS Authoritative and Secondary, Cloudflare Tunnel, Workers KV namespace operations, and Magic WAN.
Cloudflare failures are a big deal. As John Engates, Cloudflare's field CTO, recently tweeted, "Cloudflare processes about 26 million DNS queries every SECOND! Or 68 trillion/month. Plus, we blocked an average of 140 billion cyber threats daily in Q2'23."
Also: Mesh routers vs. Wi-Fi routers: What is best for your home office?
The root cause of these problems is a data center power failure combined with a failure of services to switch over from data centers having trouble to those still functioning.
Cloudflare gave a fuller explanation of what happened:
We operate in multiple redundant data centers in Oregon that power Cloudflare's control plane (dashboard, logging, etc). There was a regional power issue that impacted multiple facilities in the region. The facilities failed to generate power overnight. Then, this morning, there were multiple generator failures that took the facilities entirely offline. We have failed over to our disaster recovery facility and most of our services are restored. This data center outage impacted Cloudflare's dashboards and APIs, but it did not impact traffic flowing through our global network. We are working with our data center vendors to investigate the root cause of the regional power outage and generator failures. We expect to publish multiple blogs based on what we learn and can share those with you when they're live.
Cloudflare is still working to resolve this problem. But, since the problem was with data center power outages rather than its software, solving it may be outside its control. Hang in there, folks. Fixing this may take a while. That said, no one expected it to take this long.