Mr.PlanB

On a Tuesday morning in late November, the web buckled. Not in the dramatic, Hollywood cyber-meltdown way—with flashing red maps and ominous warnings—but in the quieter, more irritating way most people recognize: pages that won't load, apps that spin forever, and login screens that suddenly stop working. If you were online around 11:20 UTC on November 18, 2025, you might've assumed your Wi-Fi was having a tantrum. In reality, a chunk of the Internet's circulatory system was glitching all at once. Cloudflare, the giant behind what it calls the "connectivity cloud"—a sprawling global network that accelerates websites, filters attacks, powers APIs, handles authentication, manages DNS, and acts as the connective tissue for a frightening portion of global traffic—hit its biggest outage in six years. And for a few hours, Cloudflare wasn't doing any of that. Instead, it was throwing 5xx errors like confetti. The moment it started, people scrambled: site operators, dev teams, regular users trying to load their bank's login page. Cloudflare's own engineers were scrambling, too, trying to figure out why core proxy systems were spiraling, bots weren't being scored correctly, and traffic was yo-yoing between healthy and broken states every five minutes. From the outside, it felt like a major attack. Cloudflare's own internal teams thought it was an attack. Their external status page even went down—which is hosted outside Cloudflare's infrastructure. It was the kind of coincidence that doesn't feel like coincidence. But the villain wasn't a DDoS gang, a state actor, or some new botnet flexing its muscles. It was a file. A feature file. A text file that doubled in size and slipped past a limit that had been quietly waiting to explode. Inside that file was a cluster of duplicated features generated by a well-intentioned database permission change. That change—small, routine, and theoretically harmless—created a chain reaction that shut down a massive portion of global Internet traffic. This is the story of how a single configuration artifact inside Cloudflare's Bot Management system brought down the biggest edge network on the planet. ## The morning the Internet coughed The first sign of trouble showed up around 11:20 UTC: the global spike of HTTP 5xx errors surfacing across Cloudflare's network. These errors aren't rare in the wild, but across Cloudflare's backbone they should be low and steady. A sudden explosion meant something serious wasn't behaving. The strange part wasn't just the spike—it was the oscillation. The system would fail, then recover, then fail again. Five-minute intervals. A weird heartbeat. Engineers immediately pulled metrics, logs, and traces, chasing the idea that maybe this was a massive distributed attack. They'd seen plenty recently, including Aisuru-style hyperscale events. But the errors weren't lining up neatly with any clear traffic signature. Something internal was thrashing. Meanwhile, Cloudflare users saw pages that looked like their sites were broken inside Cloudflare's infrastructure. Apps dependent on Cloudflare Workers KV were throwing errors. Turnstile wasn't loading on login pages. Authentication inside Cloudflare Access stopped working for anyone who wasn't already logged in. People refreshing pages weren't seeing normal retry behavior—they were just hitting a wall. Inside the company, a flurry of incident messages flew by. The coincidence of Cloudflare's external status page going down only fueled the suspicion that this was an ongoing attack. It wasn't. But at that moment, imaginations were running hot. What nobody knew yet was that every five minutes, Cloudflare's network was flipping a coin: Good configuration file? Everything works. Bad configuration file? The proxy core panics. It all depended on which database node handled the next query. ## The trigger: a database permissions update nobody expected to matter This entire episode starts with a ClickHouse cluster—the database engine powering part of Cloudflare's analytics and machine-learning feature pipeline. ClickHouse clusters are sharded. Cloudflare's setup includes two key database layers: **default** — where distributed queries live **r0** — where the underlying tables actually sit, per shard Historically, users querying metadata via system tables (like `system.columns`) only saw the tables in the default schema. But Cloudflare engineers were rolling out a change: distributed queries should run under the initiating user's context, not a shared system account. It's a good goal: fewer broad permissions, better isolation, tighter control. At 11:05 UTC, a permissions update went live that allowed database users to explicitly see the underlying tables in r0 as well. This is where things went sideways. A query used to generate Cloudflare's Bot Management feature file looked like this: ```sql SELECT name, type FROM system.columns WHERE table = 'http_requests_features' ORDER BY name; ``` Notice the missing filter: it doesn't specify the database. Before the permission change, that didn't matter because the system only surfaced one table's metadata. But after the change? Suddenly the query returned two sets of identical-looking features—one from the default schema, one from r0. That meant duplicate features. And duplicate features meant the resulting feature file effectively doubled in size. This file wasn't some obscure log dump nobody reads. It feeds the machine learning model that every Cloudflare request passes through during bot scoring. The core proxy on every Cloudflare machine loads it into memory frequently. It's one of the most time-sensitive pieces of config Cloudflare produces, refreshed constantly, shipped everywhere. And every proxy worker process had a strict limit: It could not handle more than 200 features. Normal models used ~60. Engineers had plenty of headroom. Until suddenly they didn't. Once the file exceeded that ceiling, the proxy hit a Rust panic in FL2—the newer generation of Cloudflare's frontline proxy engine—after running head-first into an unhandled error condition in the feature loader. Panic → crash → 5xx errors everywhere. ## Why the network kept flickering Every five minutes, a scheduled job rebuilt the bot feature file. But only some nodes in the ClickHouse cluster had the new permissions. So depending on which node handled the query in that moment, Cloudflare produced either: - a valid feature configuration file or - a bad one, too large to load Good file? Global recovery. Bad file? Global outage. This roulette-wheel effect made the early investigation brutally confusing. Any time a team thought the problem might be stabilizing, the next propagation tick could instantly break the network again. Eventually, the rollout of permissions reached every node. Once that happened, the network stopped flickering because the feature file was consistently bad—and consistently breaking everything. Only then did the outage plateau into a steady, clear failure mode. ## The cascading impact: KV failures, login failures, missing bot scores Once the core proxy choked, downstream subsystems fell over like dominos: **Workers KV** Its front-end gateway depends on the proxy path. 5xx errors skyrocketed. **Turnstile** Cloudflare's human-verification widget simply stopped loading. Anyone trying to log in to a dashboard with Turnstile on the page was blocked. **Access** Authentication broke for new sessions. Already-logged-in users were fine; everyone else got error pages. **Cloudflare dashboard** Most users couldn't log in during large portions of the outage. When things were partially restored later, an influx of retry attempts created its own mini-traffic jam. **Bot Management itself** On the older FL proxy engine, bot scores didn't fail—but they went to zero. That meant customers blocking based on bot score suddenly blocked everything. False positives everywhere. This wasn't "a part of Cloudflare is glitching." It was the core of the core. ## Diagnosing the issue: the long road from suspicion to root cause Engineers noticed load spikes, retries, and odd behaviors. They tried rate limiting and traffic shifts to see if the system would stabilize. Nothing helped. At 13:05 UTC—almost two hours after impact began—the team began bypassing the core proxy for Workers KV and Access. That reduced some symptoms, buying breathing room. By 13:37 UTC, teams were zeroing in on the Bot Management configuration file as the trigger. Multiple parallel workstreams spun up: examine ClickHouse, roll back database changes, study the device logs, trace the feature file generation pipeline. Finally, at 14:24 UTC, Cloudflare halted the generation and propagation of new feature files. They injected a known-good version and pushed it globally. By 14:30, the Internet exhaled. Core traffic began flowing again. The tail end of the outage—until 17:06 UTC—was cleanup: restarting services, draining queues, stabilizing caches, clearing bad states. But the real root cause had already been found: a silent assumption deep inside a feature-generation script that depended on a query returning a fixed-size list of columns. ## The part Cloudflare didn't sugarcoat: this should never have happened Cloudflare's CEO wrote the post-incident breakdown himself. That's unusual. It signals severity. This was their worst outage since 2019. And the company didn't try to spin it. The mistakes were clear: - A database permissions change altered query behavior in an unexpected way. - A critical ML configuration file lacked guardrails for malformed content. - The core proxy didn't fail gracefully when limits were exceeded; it panicked. - Error-debugging systems themselves consumed enough CPU to worsen outages. - No kill switch prevented propagation of the broken config. - And the feature file rollout pipeline implicitly trusted internal data. For a company that prides itself on resilience, redundancy, and network-level fault isolation, these are painful truths. ## A surprisingly human moment: the internal fear of a coordinated attack One detail stood out from the community's discussion: Cloudflare's engineers genuinely thought the outage might be an attack. The status page going down independently—and coincidentally—looked like someone was hitting Cloudflare and its external health checker at the same time. It wasn't. But it made the early minutes chaotic. Customers online echoed the same suspicion. After a year full of massive cloud-scale outages, people aren't imagining threats. They've seen enough to know when the Internet groans, it's often not by accident. ## Community reactions: frustration, confusion, and some sharp questions In discussions around the outage, people raised real concerns: **Why did diagnosis take so long?** Commenters wondered how change management didn't surface the ClickHouse update instantly. **Should Cloudflare compensate customers?** The SLA technically offers credits, but users argue it doesn't cover real business impact. **Why do Cloudflare outages feel more frequent lately?** Some pointed to recent issues at other large clouds, arguing reliability across the industry seems shakier. **Why is Bot Management in the critical request path?** It makes sense technically, but it creates a single-point-of-misconfiguration that few customers realize is there. These weren't rage comments. They were practical questions from people who rely on Cloudflare to be boring, consistent, and invisible. ## The fix and the path forward Cloudflare laid out the post-mortem actions clearly: - Harden ingestion of internal config files just like untrusted user input - Add global kill switches to prevent runaway config rollouts - Improve failure handling across proxy modules - Ensure debug systems don't overload CPUs during outages - Revisit limits and memory-allocation assumptions It's an ambitious list, but Cloudflare has a history: every major outage leads to an architectural improvement. Their network today is more resilient than it was in 2019 because of past failures. After this, it will be even more battle-hardened. ## A single file, a global chain reaction It's wild to think that a doubled feature file—a simple data artifact, not code—could topple workloads across continents. But that's the world we live in: interconnected systems where the smallest internal assumption can have a planetary blast radius. This outage wasn't about a hacker or a runaway botnet. It was about a hidden dependency, exposed by a routine permission change, multiplied across tens of thousands of machines in minutes. Cloudflare apologized, owned the failure, and started the post-mortem work. But the takeaway is bigger than Cloudflare: The Internet is now a web of distributed systems that all trust each other's data, pipelines, and self-updating agents. When one assumption breaks, everything after it feels it. That's not comforting. But it's the reality of the modern web. On November 18, 2025, a feature file doubled in size. And for a few hours, the Internet shrank with it.

Inside Cloudflare's Worst Outage Since 2019: How One Feature File Took Down Half the Internet

Keep Exploring