No tool or process is ever 100%, regardless of how many people you fire and how many layers of testing you insert. Even though the fact that it impacted so many system, speaks against a test failure / corner case.
Either way, the reason this one had so much impact compared to other software failures, is that it crashed the system hard on boot, and for many system it would require physical presence for the recovery. Thatâs very complicated in large enterprise environments.
For a case like this, itâs more about how can the recovery be improved, more so than how can the problem be avoided. If that could have been fixed via a simple follow-up deployment, this would have be a much smaller news story.
Firmware updates are a similar problem. But most good hardware uses two firmware banks and alternates between them. So if an install goes south, you can switch to the standby bank with the previous version and life goes on.
Of course there is always another layer underneath. What if you have to update the boot loader code, which is responsible for managing the firmware banks. We had one such case at HP way back when. You can always find a layer where there is no redundancy, but hopefully you never have to dig that deep.
Chromebooks actually uses the same principle as the two firmware banks. On every OS updates it switches banks and keeps the pre-update copy. There is an interesting case study where a big hotel chain was attacked by ransomware during the pandemic. All Windows PC, no checkin. The IT team solved it by deploying a few palettes of of Chromebooks to replace the PCs. Back then they still need a USB stick with config data, nowadays they can dropship directly and all the config is loaded from the enterprise login. Itâs a good solution for lightweight end points like point of sales or hotel front desk. Even though otherwise Iâm not a big fan of the Chromebooks (Iâm familiar with this case study because I edited the video promoting it for Google).
And @philm, I do take issue with the notion of having to fire someone for this snafu. At least as a knee jerk reaction. Once the causes are known, this may be justified.
But I have first hand experienced a culture that had to do high risk deployments on large infrastructure. If you live by the threat of firing squad, nobody is going to take that job. Amazon (at least back then) had a culture that combined accountability with teamwork. Nobody looked for the fall guy to blame. Everyone worked together to fix issues, Each team owned their own operations. If you caused an issue you had to write up a COE and own the action items to avoid recurrence and that was it. As a result our average MTR for severity 1 incidents was 15min. You get the best of the best working togehter with just one goal - get it running again. In my years there, only one person got fired, and that was under extreme circumstances and certainly warranted.