|
Navigation
Search
|
The day the cloud went dark
Friday October 24, 2025. 11:00 AM , from InfoWorld
This week, the impossible happened—again. Amazon Web Services, the backbone of the digital economy and the world’s largest cloud provider, suffered a large-scale outage. If you work in IT or depend on cloud services, you didn’t need a news alert to know something was wrong. Productivity ground to a halt, websites failed to load, business systems stalled, and the hum of global commerce was silenced, if only for a few hours. The impact was immediate and severe, affecting everything from e-commerce giants to startups, including my own consulting business.
A quick scan of AWS’s status page confirmed that regions across the United States and Europe were reporting degraded service. Calls poured in from clients who were desperate for updates. They had invoices that couldn’t be processed, schedules that crumbled into digital dust, and so much more. I estimated over $3,000 in lost productivity for my small business alone, which is nothing compared to what some of the Fortune 500s probably faced. The real cost to businesses worldwide will likely run deep into the millions. How and why this happened When the screens froze and alerts began to flood in, my first thought was: Is this an accident or an attack? AWS engineering is still actively investigating the root cause. Early signs suggest a misconfiguration in network management systems during a routine infrastructure scaling process. As demand for cloud resources keeps rising—driven by everything from growing enterprise SaaS adoption to generative AI training workloads—cloud providers need to continually expand and improve their physical infrastructure. In this specific incident, a change that should have been routine caused critical routing hardware to fail, leading to a ripple effect across multiple AWS availability zones. AWS responded quickly, rolling back changes and isolating affected components. Communications from AWS Support, while timely, were predictably technical and lacked specifics as the crisis developed. Issues with autoscaling, load balancing, and traffic routing caused downstream effects on seemingly unrelated services. It’s a reminder that, despite the focus on “resilience” and “availability zones,” cloud infrastructure is still subject to the same fundamental laws of physics and software vulnerabilities, just like anything in your own data center. The final resolution came a few hours later, after network engineers manually rebalanced the distributed systems and verified the restoration of normal operations. Connectivity returned, but some customers reported data inconsistencies, delayed API recoveries, and slow catch-up times. The scramble to communicate with clients, reset processes, and work through the backlog served as a harsh reminder: Business continuity depends on more than hope and a robust marketing pitch from your provider. The myth of the bulletproof SLA Some businesses hoped for immediate remedies from AWS’s legendary service-level agreements. Here’s the reality: SLA credits are cold comfort when your revenue pipeline is in freefall. The truth that every CIO has faced at least once is that even industry-leading SLAs rarely compensate for the true cost of downtime. They don’t make up for lost opportunities, damaged reputations, or the stress on your teams. As regional outages increase due to the growth of hyperscale cloud data centers, each struggling to handle the surge in AI-driven demand, the safety net is becoming less dependable. What’s causing this increasing fragility? The cloud isn’t a single, uniform entity. Each expansion, new data center, and technology update adds to the complexity of routing infrastructure, physical connections, and downstream dependencies. AI and machine learning workloads are known for their high compute and storage needs. Their growth only heightens pressure on these systems. Rising demand pushes operational limits, exposing cracks in an infrastructure meant to be invisible and seamless. Be ready for the next outage This outage is a wake-up call. Headlines will fade, and AWS (and its competitors) will keep promising ever-improving reliability. Just don’t forget the lesson: No matter how many “nines” your provider promises, true business resilience starts inside your own walls. Enterprises must take matters into their own hands to avoid existential risk the next time lightning strikes. First, invest in multicloud and hybrid architectures. Relying on a single provider—no matter how big—means putting all your eggs in one basket. By designing applications to be portable across clouds (AWS, Azure, Google Cloud, or even on-premises systems), businesses can switch to a secondary platform if disaster strikes. Yes, it’s complex. Yes, it costs extra. But compared to a multimillion-dollar outage, it’s a decision that pays off. Second, automate both detection and response processes. The speed of detection and response determines who weathers the storm and who capsizes. Automated monitoring must go beyond simple system health checks to include application-level functionality and business KPIs. Systems should trigger alerts and execute runbooks that attempt recovery or at least gracefully degrade service. Human reaction time is measured in minutes; cloud failures occur in seconds. Third, don’t just write disaster recovery plans, rehearse them. Business continuity is only possible if tested under realistic conditions. Enterprises that regularly simulate cloud outages—shutting off services, rerouting traffic, and even introducing chaos—are the best prepared when the real disaster hits. The muscle memory built during drills makes all the difference when the stakes are high. Staff shouldn’t be learning the playbook in real time. We’ll spend weeks estimating the lost productivity from this AWS outage. For many, the cost will be great and the lessons learned too late. The only certainty is that this disruption won’t be the last. As the global digital economy expands and AI demands more bandwidth and computing power, outages are likely to become more frequent. As technologists and business leaders, we can hope for greater transparency and better tools from our cloud partners, but our best defense is a proactive resilience plan. The cloud is the future, but we must weather both stormy and sunny days.
https://www.infoworld.com/article/4077606/the-day-the-cloud-went-dark.html
Related News |
25 sources
Current Date
Oct, Sat 25 - 09:17 CEST
|







