Why is enterprise disaster recovery always such a…disaster?

Thursday December 4, 2025. 05:39 PM , from ComputerWorld

One of the brutal truths about enterprise disaster recovery (DR) strategies is that there is virtually no reliable way to truly test them. Sure, companies can certainly test the mechanics — but until disaster strikes, the recovery plan is activated and 300,000 workers and millions of customers start interacting with it, all bets are off.

Frank Trovato, a principal advisory director at Info-Tech Research Group, said environmental changes are a big part of the reason most disaster recovery arrangements fail in the real world.

“The exponential growth of SaaS has changed how organizations need to address DR and overall resilience,” Trovato said. “They have no control over SaaS outage recovery. They can’t failover to their own warm standby and that is a huge vulnerability. For example, if M365 has an outage, that impacts an organization’s primary means of internal and external communications, not to mention the project work maintained in MS Teams channels, SharePoint, and OneDrive. And the organization is often just waiting for the vendor to resolve the outage, with no control over the situation.”

Put simply: third-party risk is not addressed by hoping those third parties do what they are supposed to do.

“Don’t assume your SaaS vendor is following backup and DR best practices,” Trovato said. “Often, a SaaS vendor will just rely on the resilience of the cloud platform where they host. No matter what uptime numbers Azure, AWS, or GCP might claim, they all have outages and they all carry the risk of data loss.”

From a corporate politics perspective, IT managers responsible for disaster recovery have a lot of reasons to avoid an especially meaningful test. Look at it from a risk/reward perspective. They’re going to take a gamble, figuring that any disaster requiring the recovery environment might not happen for a few years. And by then with any luck, they’ll be long gone.

If a truly meaningful test were performed — something that might not even be possible — there are two likely outcomes: the environment either works well or it doesn’t. If everything works well, the manager is unlikely to get a bonus for simply having done their job properly. But if the environment fails, all kinds of bad things could happen. Why force such a situation if you can avoid it?

“People are afraid to test those strategies, to test the workflows they have put in place,” said Forrester Principal Analyst Brent Ellis. “They are not confident that the DR strategies they have in place will be effective.”

“CIOs can’t simulate 300,000 employees during an outage, so the only test that matters is whether critical applications can survive weeks of isolation,” said Emma Technologies CEO Dmitry Panenkov. “Today, most can’t.”

Sanchit Vir Gogia, the chief analyst at Greyhound Research, agreed the results of such tests are unlikely to be career-advancing. “Enterprises place too much trust in DR strategies that look complete on slides but fall apart when chaos hits,” he said. “The misunderstanding starts with how recovery is defined. It’s not enough for infrastructure to come back online. What matters is whether the business continues to function — and most enterprises haven’t closed that gap.

“Plans are built to satisfy auditors, not to handle real failure. Testing is typically shallow, skipped for months and almost always sanitized. IT confirms that the systems respond, but nobody’s watching what happens when thousands of users behave unpredictably, service layers interlock, and decision-makers scramble under pressure. The scenarios must be realistic: full failovers, real traffic, cross-team participation, and failback included.

“If it doesn’t mimic the messiness of real life, it isn’t preparing the organization for anything useful,” Gogia said, noting that major disaster recovery testing tools play the same game.

“Most DR tools, even DRaaS, only protect fragments of the IT estate,” Gogia said. “They’re scoped narrowly to fit budget or ease of implementation, not to guarantee holistic recovery. Cloud-heavy environments make things worse when teams assume resilience is built in, but haven’t configured failover paths, replicated across regions, or validated workloads post-failover. Sovereign cloud initiatives might address geopolitical risk, but they rarely address operational realism.

“The biggest flaw in DR testing is the assumption that infrastructure equals service,” he said. “Most enterprises test [whether] systems can boot, data can be restored, and dashboards stay green. But that’s not recovery. It’s a static checklist. In a real disaster, systems may come online [but] users still cannot access them. Authentication loops, misrouted traffic, unclear communication, and panicked behavior all get in the way. What tests miss is the entropy created when 100,000 users act on partial information and internal teams scramble across fragmented processes.”

The problems with most enterprise disaster recovery strategies get even worse. A popular tactic is to leverage competing hyperscalers on the rationale that even if, for example, Google fails, what are the odds that Microsoft and AWS will also fail at the same time?

But that theory is flawed because it fails to realistically factor in shared third-party reliance. Let’s not forget the dependency lessons from Cloudflare and CrowdStrike.

To be fair, many enterprises do require that all vendors bidding for business disclose all of the vendors they’re using. That’s a good start. But too many departments leave it at that. They need to further demand that those bidding vendors describe their contingency plans if any of those vendors fail, describe what would likely happen if they did fail, and offer reimbursement if they do.

Those bidding vendors “definitely need to have some skin in the game,” said Forrester’s Ellis.

Panenkov agreed: “You need to understand all of the interdependencies, especially what will happen to your enterprise if anything changes, if the third parties change or introduce a new service.”

The disaster recovery problem was recently illustrated by a Microsoft deal with SAP, Capgemini and Orange for a DR platform for European users if Microsoft is ordered to not support that region. It turns out that Microsoft, minus all of the key Microsoft services, isn’t that useful.

Ellis pointed to yet another problem with enterprise plans: line-of-business (LOB) empowerment. This is the recent trend where various LOB chiefs are given extensive leeway to experiment and deploy their own technology, especially with the latest generative AI tools and agents. But given that LOB execs often forget to inform IT of every tech move they make, empowering these business units to go their own way can leave gaps, according to Robert Kramer, vice president/principal analyst for Moor Insights & Strategy.

The LOB autonomy problem “is gigantic,” said Kramer. “You can’t build a bridge if you don’t know where it is going to go. It is a problem when these lines of business go in their own direction.”

And even when disaster recovery systems work correctly, Ellis said, IT leaders often plan poorly for downtime because the workforce is unfamiliar with the backup systems and software. He offered a scenario where Microsoft goes down and the system launches Google Docs as a temporary fix.

“The environment is different and there is a steep slowdown in productivity when you failover,” Ellis said. “When you get to SaaS platforms, that is where things get really difficult.”

The most obvious issue for well-run disaster recovery platforms is that it’s impossible to know the nature of the disaster that will trigger their use. Is it a routine internet outage? An earthquake, tornado or tidal wave that disrupts virtually all power and wireless communications? A military or terrorist attack?

How local is it? Is a proper precaution to have a mirrored environment 500 miles away — or 1,000? Overseas? In which countries?

For that matter, with all of the talk about building massive data centers in space from the likes of Google and Amazon founder Jeff Bezos, should disaster recovery options include facilities in the cosmos?

In the meantime, we’re left here on Earth, where too many IT leaders are just hoping that any big trouble will simply happen on someone else’s watch. Sounds like a disaster waiting to happen.