Mr.PlanB

For over 14 hours, the internet's most critical backbone quietly buckled under the weight of a DNS bug. On October 19th and 20th, Amazon Web Services (AWS) experienced a cascading failure in its Northern Virginia (us-east-1) region—one of its largest and most essential data hubs. The disruption started late at night, but its ripple effects hit everything from EC2 launches to Lambda invocations, container services, even Amazon Connect's call routing. You can read [AWS's official post-mortem here](https://aws.amazon.com/message/101925/). So what happened? ## It Started with a DNS Glitch At 11:48 PM PDT on October 19, AWS's DynamoDB—the NoSQL database that powers countless apps and internal AWS services—began throwing error rates. That's tech-speak for "everything started breaking." The culprit? A subtle race condition deep inside DynamoDB's automated DNS management system. Imagine two engineers trying to edit the same spreadsheet at once. One finishes and saves. The other, working on an old version, overwrites it. The result? An empty DNS record that took DynamoDB offline for anyone trying to connect via its public endpoint. While that sounds like a manageable glitch, the impact was anything but. DynamoDB is a cornerstone service—if it coughs, everything around it starts wheezing. ## When Automation Backfires To understand why a DNS bug brought down the house, you have to peek under the hood of how AWS manages service endpoints. DynamoDB operates thousands of load balancers. To juggle all that traffic, AWS relies on a highly automated DNS planner (to build plans) and DNS enactors (to apply them to Route53). These systems are supposed to work in harmony, but on that night, one enactor got delayed. While it was slowly applying an older DNS plan, another enactor zipped through with a newer one. Timing mishaps led the older one to overwrite the newer plan—then that same plan got deleted during cleanup. Result? The public endpoint vanished, DNS routing broke, and thousands of apps hit dead ends. And because this wasn't just a single instance of DNS misconfiguration—but one that got locked in by the automation system—it required humans to step in and untangle it. ## EC2 Gets Squeezed When DynamoDB dropped, so did parts of EC2, but not for the reasons you'd expect. Existing instances were fine. New ones? Not so much. The launch process for new EC2 instances depends on a subsystem called DWFM (Droplet Workflow Manager). DWFM uses DynamoDB to manage leases for physical servers (droplets). Once the DNS chaos took out DynamoDB, DWFM couldn't maintain those leases. It slowly started losing grip on the infrastructure. When AWS brought DynamoDB back online, DWFM rushed to re-establish leases—but the backlog was too large, and retries kept timing out. It spiraled into what AWS engineers called a "congestive collapse." The fix? Throttle traffic, restart components, and do some old-school manual intervention. By early afternoon on October 20, EC2 launches were mostly back. But that wasn't the end of it. ## Load Balancers Went Wobbly The Network Load Balancer (NLB) service was next to trip over the dominoes. NLB uses health checks to determine whether backend targets (typically EC2 instances) are healthy enough to serve traffic. The catch? Those health checks ran before EC2's network state was fully restored, so NLB wrongly marked perfectly fine targets as dead. This led to a back-and-forth loop—health checks failed, instances were removed, then passed, and were put back. That ping-pong overloaded the health check system and caused DNS failovers, which ironically removed even more capacity. Some users watched their traffic get rerouted, only to end up on the same unstable path. The fix came around 9:36 AM when AWS disabled auto-failover temporarily, stabilizing things. ## Lambda, ECS, and Redshift: Collateral Damage The services stacked on top of DynamoDB and EC2 also started to crack. Lambda functions failed to trigger. Internal systems couldn't poll SQS queues. Event sources got throttled. ECS, EKS, and Fargate couldn't spin up new containers—because EC2 couldn't launch the compute underneath. Amazon Connect, the cloud call center solution, lost the ability to handle chats, calls, or tasks. Some users got busy signals, others just dead air. Redshift struggled to execute queries and refresh cluster nodes. IAM integrations failed, and EC2 dependency meant some clusters got stuck in limbo. Even the AWS Management Console couldn't escape the chaos. IAM users trying to log in hit authentication errors. In some cases, entire federated login workflows went dark. ## So What's AWS Doing About It? To their credit, AWS didn't mince words. They acknowledged the outage, detailed the root causes, and laid out the roadmap for preventing it from happening again. Here's what's changing: DynamoDB's DNS automation is disabled worldwide until the race condition is patched and new safeguards are in place. The DNS Enactor will get better protections to prevent overwriting valid plans with old ones. NLB will implement "velocity control"—basically a way to limit how quickly it pulls capacity during health check failures. EC2's DWFM will go through heavier stress testing and improved queue throttling logic. That said, fixing automation problems is like patching the ship while you're sailing it. AWS is playing a delicate game of not just plugging holes, but reshaping how these foundational services talk to each other at scale. ## The Bigger Picture If this outage taught us anything, it's that the cloud isn't some abstract invincible force. It's a complex orchestra of automated systems, and even the smallest discordant note—a race condition in a DNS enactor—can throw the whole thing into disarray. For developers and businesses, this incident is a stark reminder: don't just design for uptime, design for failure. Redundancy across regions. Graceful fallbacks. And, when possible, human-in-the-loop mechanisms that can override automation when things go sideways. AWS's N. Virginia outage wasn't the biggest cloud incident in history—but it was a revealing one. It showed how tightly coupled everything in modern cloud infrastructure has become. It exposed the trade-offs of automation. And it made clear that even the most resilient systems can have blind spots. As for AWS, they've got their work cut out for them. The cloud's future still rests on their shoulders—but for a few hours in October, those shoulders got a little shaky.

Inside AWS's October Outage and What Went Wrong

Keep Exploring