33 comments

Apple Network Failure Destroys an Afternoon of Worldwide Mac Productivity

Somewhere around 3:25 PM Eastern Standard Time on 12 November 2020, my 27-inch iMac running macOS 10.15.7 Catalina started to behave oddly, displaying the dreaded “spinning pizza of death” wait cursor when trying to perform operations that are typically lightning fast. I decided to reboot, as one does. Interestingly, Josh Centers had just told me that he was rebooting his iMac as well because, as he said in the TidBITS Slack, “Mojave has gotten a little wonky.”

Rebooting didn’t fix anything and, in fact, made things worse because then we couldn’t launch any non-Apple apps on our Macs. Mail and Safari launched fine, but other apps did not. Clicking an app icon in the Dock did nothing other than cause our Macs to make an unfriendly “ding” noise.

We hadn’t yet discussed our mutual Mac headaches, and Josh had become convinced that something had gone wrong with his iMac’s SSD, so he booted into Recovery Mode and began running diagnostics. Then he switched to his MacBook Pro and found that it wouldn’t launch apps either.

Josh’s next message in Slack (from his iPhone) was:

This is extremely weird. I can’t launch Slack or Firefox on either of my Macs. Is anyone else seeing something like this?

To which I replied (from my iPhone):

I just rebooted due to my iMac being a little weird, and none of my login items launched. I was able to launch the App Store app, and some updates are downloading. Preview launches, but neither Firefox nor Slack do.

Josh checked Twitter and found a post from developer Jeff Johnson, the guy behind StopTheMadness (which improves the Web browser experience), Link Unshortener (which reveals the destination of shortened links), and Underpass (for peer-to-peer file transfer and chat with end-to-end encryption). Johnson’s tweet, which went viral, explained what was happening: the macOS trustd process was trying and failing to connect to a server called ocsp.apple.com.

Jeff Johnson's tweet about the ocsp.apple.com problem

Non-Apple apps actually were launching, but only after their attempts to connect to ocsp.apple.com timed out. A successful connection to ocsp.apple.com is not required for apps to launch, which is why you can launch apps while entirely offline. That’s why Johnson suggested blocking ocsp.apple.com using Little Snitch or another firewall, or just disconnecting from the Internet whenever you wanted to launch an app.

Shortly after that, others offered the more straightforward solution of adding a line to the /etc/hosts file that maps hostnames to IP addresses in a way that overrides DNS. If you pointed ocsp.apple.com to 127.0.0.1 or 0.0.0.0 in /etc/hosts, connections to ocsp.apple.com failed instantly, returning the Mac to normal operating status. I’m not providing those instructions here because they’re no longer necessary and in general, messing with /etc/hosts isn’t something you should do unless you already understand how it works. If you did edit /etc/hosts in this way, you should remove that line; Brian Matthews provided a command-line recipe for that in TidBITS Talk.

After an hour or so, Apple fixed the problem, and everything returned to normal.

How Did This Happen?

So what was going on? As I understand it, at app launch, Apple’s GateKeeper technology checks the certificates that Apple assigns to developers to sign their code. The name of the Apple server in question—ocsp.apple.com—points to Apple using OCSP (Online Certificate Status Protocol) to determine if an app’s certificate has been revoked. If that’s the case, macOS prevents the app from launching—it’s Apple’s way of ensuring that it can prevent an app discovered to be malicious from causing more damage. (You may remember that HP just suffered from self-inflicted problems after it unintentionally revoked a certificate—see “Code-Signing Snafu Breaks Many HP Printers,” 26 October 2020.)

What prevented ocsp.apple.com from responding? I doubt Apple will ever share details, and heads may already have rolled, but my understanding is that the massive load from releasing macOS 11 Big Sur resulted in the failure of a CDN—a content delivery network—that Apple uses to handle such situations (this particular one appears to be run by Akamai Technologies, which is not unusual). Since Big Sur weighs in at 12 GB, compared to 8 GB for Catalina, it’s not entirely surprising that the load would be much higher. Plus, of course, Apple has sold millions more Macs in the last year.

Support for this theory comes from the fact that other Apple services were down that day as well. Apple’s System Status page showed problems with Apple Card, Apple Pay, iMessage, macOS Software Update (those Big Sur downloads), and Maps.

Apple System Status page during the debacle

Apple Caused a Massive Waste of Time

It’s hard to overstate the effect this problem had on the Mac world. Although Josh and I were able to get our iMacs working properly again reasonably quickly, the rest of our afternoon disappeared into trying to figure out what was happening. In the MacAdmins Slack, IT admins and consultants were doing the same, not just because of their personal Macs but also because they were being deluged with calls, email messages, and trouble tickets from their users and clients. Developers received bug reports demanding fixes, and the problem disrupted many online presentations, meetings, and conferences taking place during that time. A Hacker News thread about the problem garnered over 1150 comments, including some from Mac users who, like Josh, wasted significant time with troubleshooting, worried that their Macs had suffered a hardware failure.

Apple may not have actually taken every Mac in the world offline, but this network failure wasted several hours of time for what must have been millions of Mac users. (I suspect that people who weren’t attempting to launch apps during this time might not have noticed.) Nothing will give us that time back, but an acknowledgment and apology would be welcome.

This debacle also threw a spotlight on what seems like a weak point in macOS. It’s clear that Apple designed trustd to fail silently and gracefully when a Mac is offline, but why is there such a long timeout in the event of a network failure? Are there other components of macOS that make similar checks in everyday usage that could hurt the user experience in error conditions?

As always, the question of security comes up as well. We’ve just learned that ocsp.apple.com is a weak link in the normal functioning of macOS. It’s obviously not a single overworked server under someone’s desk—the entire point of using the Akamai CDN is to make it possible to handle massive amounts of traffic—but I assume that malicious actors are investigating how to launch a denial-of-service attack against ocsp.apple.com.

There may also be some privacy implications, since the checks to ocsp.apple.com whenever you launch a non-Apple app could reveal information about you to someone who could access your network. That seems a little overblown to me—someone who can access your network has a lot more than OCSP traffic to work with. It doesn’t appear that Apple’s OCSP traffic is using OCSP stapling, which addresses those privacy concerns.

Some people have suggested using something like a Pi-hole to block ocsp.apple.com entirely. (You could use Little Snitch for that in versions of macOS prior to Big Sur, but as security researcher Patrick Wardle pointed out, trustd is one of the Apple apps whose traffic Little Snitch can no longer block—see “Apple Hides Traffic of Some of Its Own Apps in Big Sur,” 22 October 2020.) Blocking ocsp.apple.com seems like a bad idea because you would be vulnerable to any malware that Apple discovered and addressed by revoking its developer certificate. Apple runs many hosts that modern Macs must be able to contact at particular times for certain operations.

In the end, it’s hard to avoid feeling a little less confident in the Mac. I honestly believe this was a rare error on the part of Apple’s network operations staff, such that we’re extremely unlikely to ever suffer from it again. I also anticipate that Apple will be taking steps within macOS to prevent similar situations from occurring in the future and to address the concerns that this situation raised.

In fact, since I initially published this article, Apple updated its “Safely open apps on your Mac” support page with this text:

Privacy protections

macOS has been designed to keep users and their data safe while respecting their privacy.

Gatekeeper performs online checks to verify if an app contains known malware and whether the developer’s signing certificate is revoked. We have never combined data from these checks with information about Apple users or their devices. We do not use data from these checks to learn what individual users are launching or running on their devices.

Notarization checks if the app contains known malware using an encrypted connection that is resilient to server failures.

These security checks have never included the user’s Apple ID or the identity of their device. To further protect privacy, we have stopped logging IP addresses associated with Developer ID certificate checks, and we will ensure that any collected IP addresses are removed from logs.

In addition, over the next year we will introduce several changes to our security checks:

  • A new encrypted protocol for Developer ID certificate revocation checks
  • Strong protections against server failure
  • A new preference for users to opt out of these security protections

Those changes are all positive, and while it’s too bad that Apple failed to institute them proactively before this situation, I think this is mostly an indication of how hard security is. There’s certainly no conspiracy on Apple’s part—the company is only hurt when its actions detract from its pro-privacy stance.

Regardless, the fact that an Apple mistake could render Macs in general nearly useless shows just how interwoven our modern lives are with corporations like Apple. Not that it’s going to happen, or that there’s any realistic alternative, but if Apple were to disappear, our devices almost certainly wouldn’t continue to operate at their full capability.