|
Navigation
Search
|
Snowflake software update caused 13-hour outage across 10 regions
Friday December 19, 2025. 03:36 PM , from InfoWorld
A software update knocked out Snowflake’s cloud data platform in 10 of its 23 global regions for 13 hours on December 16, leaving customers unable to execute queries or ingest data.
Customers saw “SQL execution internal error” messages when trying to query their data warehouses, according to Snowflake’s incident report. The outage also disrupted Snowpipe and Snowpipe Streaming file ingestion, and data clustering appeared unhealthy. [ Related: More Snowflake news and insights ] “Our initial investigation has identified that our most recent release introduced a backwards-incompatible database schema update,” Snowflake wrote in the report. “As a result, previous release packages errantly referenced the updated fields, resulting in version mismatch errors and causing operations to fail or take an extended amount of time to complete.” The outage affected customers in Azure East US 2 in Virginia, AWS US West in Oregon, AWS Europe in Ireland, AWS Asia Pacific in Mumbai, Azure Switzerland North in Zürich, Google Cloud Platform Europe West 2 in London, Azure Southeast Asia in Singapore, Azure Mexico Central, and Azure Sweden Central, the report said. Snowflake initially estimated service would be restored by 15:00 UTC that day, but later revised it to 16:30 UTC as the Virginia region took longer than expected to recover. The company offered no workarounds during the outage, beyond recommending failover to non-impacted regions for customers with replication enabled. It said it will share a root cause analysis (RCA) document within five working days. “We do not have anything to share beyond this for now,” the company said. Why multi-region architecture failed to protect customers The type of failure that hit Snowflake — a backwards-incompatible schema change causing multi-region outages — represents a consistently underestimated failure class in modern cloud data platforms, according to Sanchit Vir Gogia, chief analyst at Greyhound Research. Schema and metadata sit in the control plane layer that governs how services interpret state and coordinate behavior across geographies, he said. “Regional redundancy works when failure is physical or infrastructural. It does not work when failure is logical and shared,” Gogia said. “When metadata contracts change in a backwards-incompatible way, every region that depends on that shared contract becomes vulnerable, regardless of where the data physically resides.” The outage exposed a misalignment between how platforms test and how production actually behaves, Gogia said. Production involves drifting client versions, cached execution plans, and long-running jobs that cross release boundaries. “Backwards compatibility failures typically surface only when these realities intersect, which is difficult to simulate exhaustively before release,” he said. The issue raises questions about Snowflake’s staged deployment process. Staged rollouts are widely misunderstood as containment guarantees when they are actually probabilistic risk reduction mechanisms, Gogia said. Backwards-incompatible schema changes often degrade functionality gradually as mismatched components interact, allowing the change to propagate across regions before detection thresholds are crossed, he said. Snowflake’s release documentation describes a three-stage deployment approach that “enables Snowflake to monitor activity as accounts are moved and respond to any issues that may occur.” The documentation states that “if issues are discovered while moving accounts to a full release or patch release, the release might be halted or rolled back,” with follow-up typically completed within 24 to 48 hours. The December 16 outage affected 10 regions simultaneously and lasted well beyond that window. “When a platform relies on globally coordinated metadata services, regional isolation is conditional, not absolute,” Gogia said. “By the time symptoms become obvious, rollback is no longer a simple option.” Rollback presents challenges because while code can be rolled back quickly, state cannot, Gogia said. Schema and metadata changes interact with live workloads, background services, and cached state, requiring time, careful sequencing, and validation to avoid secondary corruption when reversed. Security breach and outage share common weakness The December outage combined with Snowflake’s security troubles earlier in 2024 should fundamentally change how CIOs define operational resilience, according to Gogia. In mid-2024, approximately 165 Snowflake customers were targeted by criminals using stolen credentials from infostealer infections. “These are not separate incidents belonging to different risk silos. They are manifestations of the same underlying issue: control maturity under stress,” Gogia said. “In the security incidents, stolen credentials exploited weak identity governance. In the outage, a backwards-incompatible change exploited weak compatibility governance.” CIOs need to move beyond compliance language and uptime averages to ask behavioral questions about how platforms behave when assumptions fail, Gogia said. “The right questions are behavioral. How does the platform behave when assumptions fail. How does it detect emerging risk. How quickly can blast radius be constrained.”
https://www.infoworld.com/article/4109586/snowflake-software-update-caused-13-hour-outage-across-10-...
Related News |
25 sources
Current Date
Dec, Fri 19 - 18:19 CET
|







