MacMusic  |  PcMusic  |  440 Software  |  440 Forums  |  440TV  |  Zicos
aws
Search

The AWS outage post-mortem is more revealing in what it doesn’t say

Monday November 3, 2025. 08:00 AM , from ComputerWorld
When AWS suffered a series of cascading failures that crashed its systems for hours in late October, the industry was once again reminded of its extreme dependence on major hyperscalers. (As if to prove the point, Microsoft suffered a similar collapse a few days later.)

The incident also shed an uncomfortable light on how fragile these massive environments have become. In Amazon’s detailed post-mortem report, the cloud giant detailed a vast array of delicate systems that keeps global operations functioning — at least, most of the time.

It is impressive that this combination of systems works as well as it does — and therein lies the problem. The foundation for this environment was created decades ago. And while Amazon deserves applause for how brilliant that system was when it was created, the environment, scale and complexity facing hyperscalers today are orders of magnitude beyond what those original designers envisioned. 

The bolt-on patch approach is simply no longer viable. All of the hyperscalers —especially AWS — need re-architected systems, if not entirely new systems that can support global users for 2026 and beyond. 

Chris Ciabarra, the CTO of Athena Security, read the AWS post-mortem and came away uneasy.

“Amazon is admitting that one of its automation tools took down part of its own network,” Ciabarra said. “The outage exposed how deeply interdependent and fragile our systems have become. It doesn’t provide any confidence that it won’t happen again. ‘Improved safeguards’ and ‘better change management’ sound like procedural fixes, but they’re not proof of architectural resilience. If AWS wants to win back enterprise confidence, it needs to show hard evidence that one regional incident can’t cascade across its global network again. Right now, customers still carry most of that risk themselves.”

Catalin Voicu, cloud engineer at N2W Software, echoed some of the same concerns. 

“The underlying architecture and network dependencies still remain the same and will not go away unless there is an entire re-architect of AWS,” Voicu said. “AWS claims a 99.5% availability for this reason. They can put band aids on problems,  but the nature of these hyperscalers is that core services call back to specific regions. This is not going to change anytime soon.”

Forrester principal analyst Brent Ellis’s interpretation of the post-mortem is that AWS— not unlike other hyperscalers” — has services that are single points of failure that are not well-documented.”

Although Ellis stressed that “AWS is doing an amazing amount of operations here,” he added that “no amount of well-architected [technology] would have shielded them from this problem.”

Ellis agreed with others that AWS didn’t detail why this cascading failure happened on that day, which makes it difficult for enterprise IT executives to have high confidence that something similar won’t happen in a month. “They talked about what things failed and not what caused the failure. Typically, failures like this are caused by a change in the environment. Someone wrote a script and it changed something or they hit a threshold. It could have been as simple as a disk failure in one of the nodes. I tend to think it’s a scaling problem.”

Ellis’s key takeaway: hyperscalers need to look seriously at major architectural changes. “They created a bunch of workarounds for the problems they encountered internally. This means that the first hyperscaler is suffering from a little bit of technical debt. Architectural decisions don’t last forever,” Ellis said. “We are hitting the point where more is required.”

Let’s dig into what AWS said. Although many reports attributed the cascading failures to DNS issues, it’s unclear how true that is. It does indeed appear that DNS systems were where the problems were first spotted, but AWS didn’t explicitly say what led to the DNS issue.

AWS said the problems started with “increased API error rates” in its US-East-1 region, which were immediately followed by the AWS “Network Load Balancer (NLB) experienced increased connection errors for some load balancers.” It said the NLB problems were “caused by health check failures in the NLB fleet, which resulted in increased connection errors on some NLBs.” AWS then detected that “new EC2 instance launches failed” followed by “some newly launched instances experienced connectivity issues.” 

Bad things mushroomed from there. “Customers experienced increased Amazon DynamoDB API error rates in the N. Virginia (us-east-1) Region. During this period, customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service. The incident was triggered by a latent defect within the service’s automated DNS management system that caused endpoint resolution failures for DynamoDB.”

AWS then offered this theory: “The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.”

And then this lovely thing happened: “While the Support Center successfully failed over to another region as designed, a subsystem responsible for account metadata began providing responses that prevented legitimate users from accessing the AWS Support Center. While we have designed the Support Center to bypass this system if responses were unsuccessful, in this event, this subsystem was returning invalid responses. These invalid responses resulted in the system unexpectedly blocking legitimate users from accessing support case functions.”

This section is rather long, but I want to let AWS explain this in its own words: 

“The race condition involves an unlikely interaction between two of the DNS Enactors. Under normal operations, a DNS Enactor picks up the latest plan and begins working through the service endpoints to apply this plan. This process typically completes rapidly and does an effective job of keeping DNS state freshly updated. Before it begins to apply a new plan, the DNS Enactor makes a one-time check that its plan is newer than the previously applied plan. As the DNS Enactor makes its way through the list of endpoints, it is possible to encounter delays as it attempts a transaction and is blocked by another DNS Enactor updating the same endpoint. In these cases, the DNS Enactor will retry each endpoint until the plan is successfully applied to all endpoints.

“Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. “

Therefore, this did not prevent the older plan from overwriting the newer plan. The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. As this plan was deleted, all IP addresses for the regional endpoint were immediately removed. Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors. This situation ultimately required manual operator intervention to correct.”

Near the end of the report, AWS talked about what it was doing to fix the situation: “We are making several changes as a result of this operational event. We have already disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide. In advance of re-enabling this automation, we will fix the race condition scenario and add additional protections to prevent the application of incorrect DNS plans. For NLB, we are adding a velocity control mechanism to limit the capacity a single NLB can remove when health check failures cause AZ failover. For EC2, we are building an additional test suite to augment our existing scale testing, which will exercise the DWFM recovery workflow to identify any future regressions. We will improve the throttling mechanism in our EC2 data propagation systems to rate limit incoming work based on the size of the waiting queue to protect the service during periods of high load.”

That’s all well and good, but it feels like a series of urgent fires being put out, with no grand plan to prevent anything like the outage from happening again. Put simply, AWS appears to be fighting yesterday’s battle. 

True, these changes might prevent this exact set of problems from happening again, but there is an almost infinite number of other problems that could arise. And that situation isn’t going to get better. As volume continues to soar and complexity — hello, agentic AI — increases, trainwrecks like this one will happen with increasing frequency.

If AWS, Microsoft, Google and others find themselves so invested in their environments that they can’t do anything other than apply patches here and there, it’s time for a few clever startups to come in with a clean tech slate and build what is needed. 

The logical threat: Fix it yourselves, hyperscalers, or let some VC-funded startups do it for you.
https://www.computerworld.com/article/4082890/the-aws-outage-post-mortem-is-more-revealing-in-what-i...

Related News

News copyright owned by their original publishers | Copyright © 2004 - 2025 Zicos / 440Network
Current Date
Nov, Mon 3 - 16:56 CET