McDonald’s serves up a master class in how not to explain a system outage

Monday April 1, 2024. 08:00 AM , from ComputerWorld

The global outage that last month prevented McDonald’s from accepting payments prompted the company to release a lengthy statement that should serve as a master class in how not to report an IT problem. It was vague, misleading and yet the company used language that still allowed many of the technical details to be figured out.

(You know you’ve moved far from home base when Burger King UK makes fun of you— in response to news of the McDonald’s outage, Burger King played off its own slogan by posting on LinkedIn: “Not Loving I.T.”)

The McDonald’s statement was vague about what happened, but it did opt to throw the chain’s point-of-sale (POS) vendor under the bus — while not identifying which vendor it meant. Classy.

The statement, issued shortly after the outage began — but before it had ended — said: “Notably, this issue was not caused by a cybersecurity event; rather, it was caused by a third-party provider during a configuration change.” A few hours later, it quietly changed that sentence by adding the word “directly,” as in “was not directly caused by a cybersecurity event.”

That insert raised all kinds of issues. Technically, it meant that there absolutely was a “cybersecurity event” somewhere — presumably not affecting McDonald’s or its POS provider — that somehow played a role in the outage. The most likely scenario is that either McDonald’s or the POS provider learned of an attack elsewhere (quite possibly multiple attacks) that leveraged a POS hole that also existed in the McDonald’s environment.

One of the two then decided to implement an emergency fix. And due to insufficient or non-existent testing of the patch, the company’s systems crashed. That would explain how the outage could have been indirectly caused by a cybersecurity event.

Let’s go back to the statement, where we find more breadcrumbs about what likely happened. In it, McDonald’s Global CIO Brian Rice said: “At approximately midnight CDT on Friday, McDonald’s experienced a global technology system outage, which was quickly identified and corrected. Many markets are back online, and the rest are in the process of coming back online. We are closely working with those markets that are still experiencing issues.”

Initially, those sentences would appear to have a contradiction. One sentence said the outage was “quickly identified and corrected” and the next says that many markets are still offline. If it had actually been quickly corrected, why were so many systems still offline at the time of the statement?

The answer that seems to explain the contradiction is DNS. That would explain how the problem could have been “corrected,” but the correction had not reached everyone yet. DNS needs time to propagate and given the far-flung geographies affected (including the United States, Germany, Australia, Canada, China, Taiwan, South Korea and Japan), the one- to two-day delay that hit some areas is just about what would be expected with a DNS issue.

As for throwing a vendor under the bus, consider the chain’s second update, which said: “In the coming days, we will be analyzing the issue and pushing for accountability across our teams and third-party vendors.” That’s fine. But the day before, the statement said that the outage “was caused by a third-party provider during a configuration change.”

The incident was only hours-old and the company wanted to be clear that it was the vendor’s fault. Methinks, Ronald, thou doth protest too much. Who hired the vendor? Whose IT team was managing that vendor? Did the McDonald’s IT team tell the vendor to fix it immediately? Was there an implication that if they cut a few procedural corners to make it happen, no one would ask questions?

This line might be warranted if the third-party went renegade and made changes itself without asking McDonald’s. But that seems highly unlikely. And if it were true, wouldn’t McDonald’s have said so directly? Also, there’s a certain oddness to throwing someone under the bus while keeping the company’s identity secret. You don’t get points for blaming someone and then not saying who is being blamed.

Then there is the franchisee factor at play here. McDonald’s doesn’t own many of its restaurants, but it does impose strict requirements, which includes that they have to use McDonald’s chosen POS system. (♩ ♪ ♫ ♬You deserve a break today, so we broke our POS, you can’t pay!♩ ♪ ♫ ♬)

Note: Computerworld reached out to McDonalds for comment hours after the initial statement was issued. No one replied.

Mike Wilkes, director of cyber operations at The Security Agency, was one of several security people who saw DNS as the most likely culprit.

“This looks like it was a DNS failure that turned into a global outage, a configuration error,” he said. “It was probably an insufficiently tested patch or a fat-fingered patch.” Wilkes noted that the outage did not impact the McDonald’s mobile app, which — if true — is another clue to what happened.

Part of the delay was not merely that DNS needs time to propagate, but that McDonald’s would have needed to send the change via different DNS resolvers. “This was likely a DNSSEC (Domain Name System Security Extensions) change intended to improve their security.”

Wilkes also suspected that a TTL (time to live) setting played a role. “No one likely had time to lower the TTL to have a recovery time of five minutes,” he said, which would further explain the lengthy delays.

Terry Dunlap, co-founder and managing partner of Gray Hat Academy, also believed the McDonald’s outage appeared to be an attempt to quickly block a potentially imminent attack. “They were saying ‘Give me a life vest. I don’t want to be drowned by the wave that is coming.’”

More strategically, Dunlap was not a fan of the statements McDonald’s issued.

“It’s much better to be proactive and as detailed as possible upfront,” he said. “I don’t think that the statements conveyed the level of warm and fuzzies needed. I would recommend going into more details. How did you respond to it? Why did it happen? What impacts have occurred that you are not telling me? (The McDonald’s statements) create more questions than answers.”

This appropriately raises yet again the enterprise risk coming from third-parties — especially those who, as might be the case with McDonald’s, act on their own and cause problems for the enterprise IT team.

“Every company is being flyspecked for their third-party risk management right now,” said Brian Levine, a managing director with Ernst & Young (EY). “Third-party risk management is increasingly being put under the microscope today by courts, regulators and companies.”

McDonald’s did not initially file an SEC report on the incident. Given that Wall Street did not react in any serious way to the McDonald’s outage, it’s unlikely McDonald’s would consider the outage material. As for the third-party POS provider, it’s unclear whether it filed a report as its identity has yet to be confirmed.

Among the important lessons here for all enterprise IT, is to give careful thought to outage statements. Anything beyond, “Something happened. We are investigating and will report more once facts are known and verified” is going to leave clues.

Vague implications are not your friend. If you are ready to say something, say it. If you are not, say nothing. Splitting the middle as McDonald’s did won’t likely serve your long-term interests (not unlike eating McDonald’s food). But at least a quarter-pounder tastes good and is filling.

The McDonald’s outage statement was neither.

Data Center, Mobile Payment, Networking, Security