Yesterday’s two-and-a-half hour Facebook outage was the site’s worst in four years, and its effects extended far beyond Facebook: Since many sites (including this one) make use of Facebook social plugins like the “Like” button, there were a lot of ugly unfilled boxes all over the Internet when the site was down. It’s scary to what degree so much of the Internet has become reliant on Facebook, for better or worse.
Following the outage, Facebook software engineering director Robert Johnson posted a note explaining the technical details: Basically, a change to the site caused a preexisting system for correcting errors interpret the configuration values for every single client as invalid. All of the system’s attempted fixes caused Facebook’s clusters to be “quickly overwhelmed by hundreds of thousands of queries a second.”
Guardian tech writer Charles Arthur has handily parsed Johnson’s note in a recent posting, where he boils Facebook’s solution down to the oldest of strategies for frustrated tech users: Turning everything off and then on again. Arthur: “Ever been on the phone to IT support and they told you to turn it off and then on again, and that sorts it out? Facebook last night had that sort of problem. So they turned the site off and on again. And it fixed their problem. Literally.” No small feat for a site with tentacles as far-reaching as Facebook’s.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.
This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.
Arthur translates that last bit: “there may be some times over the next few days when you won’t be able to reach Facebook in particular places, or that unusual things will happen.” So some Facebook users may be forced to roam the streets in tears, shoving photos of themselves in people’s faces and screaming ‘DO YOU LIKE THIS? DO YOU??’ as they were yesterday, but hopefully not for too long or in too great a volume.
Have a tip we should know? firstname.lastname@example.org