Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections

Tuesday June 17, 2025. 02:50 AM , from Slashdot

Google Cloud has attributed last week's widespread outage to a flawed code update in its Service Control system that triggered a global crash loop due to missing error handling and lack of feature flag protection. The Register reports: Google's explanation of the incident opens by informing readers that its APIs, and Google Cloud's, are served through our Google API management and control planes.' Those two planes are distributed regionally and 'are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints.' The core binary that is part of this policy check system is known as 'Service Control.'

On May 29, Google added a new feature to Service Control, to enable 'additional quota policy checks.' 'This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code,' Google's incident report explains. The search monopolist appears to have had concerns about this change as it 'came with a red-button to turn off that particular policy serving path.' But the change 'did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.'

Google uses feature flags to catch issues in its code. 'If this had been flag protected, the issue would have been caught in staging.' That unprotected code ran inside Google until June 12th, when the company changed a policy that contained 'unintended blank fields.' Here's what happened next: 'Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.'

Google's post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes. But in some larger Google Cloud regions, 'as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on... overloading the infrastructure.' Service Control wasn't built to handle this, which is why it took almost three hours to resolve the issue in its larger regions. The teams running Google products that went down due to this mess then had to perform their own recovery chores. Going forward, Google has promised a couple of operational changes to prevent this mistake from happening again: 'We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity.'

Read more of this story at Slashdot.