Facebook outage

As the whole Internet knows, Facebook and other stuff they own were all down for several hours a few days ago. They were off the network entirely: DNS couldn't resolve their host names. A post from Cloudflare describes what happened from the outside, including explaining how some of the key parts work (like BGP and Autonomous Systems, terms I learned this week), and a post from Facebook explains what happened inside.

From Cloudflare:

Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else's DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.

But that's not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows.

This happened in part because apps won't accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won't take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively.

[...] So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms.

Also, today I learned that Cloudflare owns 1.1.1.1. They don't seem old enough to have been issued that; did they buy it from someone?

From Facebook:

When you open one of our apps and load up your feed or messages, the app’s request for data travels from your device to the nearest facility, which then communicates directly over our backbone network to a larger data center. [...] The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.

This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse.

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.

And then the measures that protect their data centers from tampering kicked in when engineers tried to fix it.

They don't say, and I don't know, what the command was that was meant to query the network and actually shut it down. Yes they had (faulty) auditing, but I have more fundamental questions, like: was there no "this will take down the network; are you sure? (Y/N)" check in that command?


Edited to add: I just came across a good explanation by mdlbear.