Facebook Engineers posted a detailed post on their Facebook page explaining the details about yesterday’s 2.5 hour outage, the worst outage in over four years. Facebook says that, ultimately, the severity of the outage was caused by “an unfortunate handling of an error condition.”
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn�t allow the databases to recover.
Ultimately, the outage was Facebook’s own fault, the result of a faulty configuration value fix system multiplied by millions. The result? Overworked servers and databases brought to their knees. This sort of transparency is not the kind that Facebook wants to have to employ often, but as usual, transparency is the answer to any company’s’ woes.
They are the smoothest feet I’ve ever seen on anyone older than 2 years of age.