Facebook went dark for more than two and a half hours yesterday. It is the biggest service outage in more than 4 years. Like buttons disappeared from all over the web. And Gene Weingarten rejoiced.
Before engineers in Palo Alto could finally fix what ailed the world's biggest social network they actually had to shut the whole thing down and reboot. So what went wrong?
Robert Johnson, Facebook's director of software engineering, explained it all here.
Charles Arthur at the Guardian did a nice job decoding it. And for the truly slothful I've attempted to create this plain English crib sheet.
With more than 500 million members uploading pictures, status updates and other blather everyday, Facebook has enormous stores of data it needs to back up and cache. And like any big data storage operation it can't simply rely on one set of servers to do the job.
So Facebook built a network of servers around what Johnson calls a "persistent store." Think of it like the hub on a bicycle wheel - with spokes connecting it to Facebook's other servers. When something goes wrong on a satellite server out there on the rim of the wheel they are programmed to check in with the hub for a fix.
But yesterday, Facebook's engineers inserted a bad piece of code into the heart of the machine. It was uploaded by all the servers around the rim of the wheel - and when the code didn't work the servers followed instructions and queried the hub asking for a fix...again and again and again.
This whole mess accelerated until the wheels came off, Facebook crashed, and the engineers at Palo Alto had to put the axle back together again and reboot.