Well the first version of the root cause analysis is in.
Confirming what we suspected.
A combination of a mass data corruption in the metadata server that triggered a self-inflicted DDoS on the discovery service. They essentially had to rebuild an entire fleet of thousands of meta data servers both in terms of infrastructure and data, while also checking and understanding all the security issues.
On an executional level, very impressive and textbook case given the scale and complexity. And all the backups worked
The root cause analysis is still soft on well the ‘root cause’. What they know is that all the disks of their metadata servers got overwritten with random data in multiple regions and timezones. They assume that someone gained access to an internal server that had elevated access to the production environment, and used that to launch the attack.
What doesn’t totally compute is that this attack took down thousands of servers around the world all at once. This requires a script that is deployed through the same mechanisms that push out software and config changes to the server fleet. Which leaves the door open that this may have been a very unfortunate self-goal.
Not sure we’ll ever find out for certain. Or if we find out, it was a bad guy, if not, well silence speaks too. That being said, these things can happen to the best.
For me the biggest take away is that the engineering effort of the recovery, even though it took 48hrs, is very impressive, with a lot of skill and resources. Ultimately that is what you should pick your vendors for, not who can sell you the cheapest cloud storage with a fancy slide deck, website, and story. There are a lot of companies out there that would struggle to match this on an operational level. So don’t be cheap.
Also, this type of effort only works in cloud environments. On-prem, this outage would have been a lot longer. The old days…