A day where you're happy Flame doesn't run on Windows

Worldwide massive IT outages halting flights and banks, after a bad content update from Crowdstrike breaks a lot of Windows systems. MSFT has work around now, but recovering impacted systems may be non-trivial.

Still a developing story. But has been deemed biggest IT fail since 2017.

Apparently the fix is to boot in safe mode, delete a specific file and then reboot. Manageable on your own local PC, but maybe not as trivial in large centrally managed IT environments.

1 Like

Wow. New Zealand banks are off-line. Airlines in India are handwriting boarding passes and then taking attendance. United, American Airlines, frontier, Airlines, and more are grounded in the US. 911 Emergency Call Centers are disrupted.

This is computing carnage at a scale we’ve never seen.

Probably one line of code that got botched.

A different variation of ‘Too Big to…’

BTW - these are the fix instructions:

  1. Boot Windows into Safe Mode or the Windows Recovery Environment
  2. Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
  3. Locate the file matching “C-00000291*.sys” and delete it.
  4. Boot the host normally.

But in the era of safe boot, etc. this is not easy to do for all the systems agents use at the airport to check people in.

Oh. My. Gawd.

1 Like

hahahahhaha…When your titled sponsor causes you to fuck up Free Practice 1 on race weekend…

1 Like

Every day is a day that I’m happy flame doesn’t run on Windows…

Colossal quantities of unfathomable badness happened in that window of opportunity during yesterday’s war-gaming exercise…

4 Likes

That’s is such a year 2000 screw up

It should be kept in mind that Crowdstrike’s Falcon product update contains the bug that is the cause of the issue and not the Windows OS per se. And Crowdstrike Falcon is also available for the Mac, although they didn’t push out an update with the same bug (yet?).

This could have happened to any machine which receives pushed automatic updates without sandboxing first. Although we use security apps other than Crowdstrike, we never allow auto updates to be pushed out by the vendor until we’ve thoroughly tested them for weeks for exactly this reason. Even a new Flame or Avid update gets tested in situ for some time before updating multiple machines.

This debacle is a case of far too many people (both within Crowdstrike and their customers) being sloppy in ‘trusting’ an automatic new update before before it was properly cooked.

3 Likes

Well, it just shows that there are a lot of single points of failures in our over overall global IT infrastructure. This one was probably one unlucky engineer. But each one of these failure points can also be exploited by threat actors. We just showed Putin another options for causing carnage across the West.

The other day I watched a long video about the length airline pilots now have to go to deal not with just GPS interference but GPS spoofing. Absolutely bonkers. Interesting enough NG and Max versions of the 737 no longer have a global clock as part of the avionics bay. Instead they rely on GPS for the date and time. Excpet if you’re in a spoofed zone around Ukraine or Middle East. You better have an analog wrist watch or you’re screwed.

Thanks for stating these clarifications. All this became clear as the day unfolded. And indeed a good reminder to turn off automatic updates, especially while you’re in the middle of projects, and then set a few hours aside after the project wraps to run all the updates (after making a snapshot in case something goes wrong). It’s a trade-off though consider some high-risk zero day vulnerabilities on long running projects. You could air gap, but that comes with yet another set of of challenges. No easy answers.

What I don’t understand is why MS allows 3rd party to push out updates to their OS. Didnt anyone at MS check the update first?

The security toolkit by design has access to the OS.

As @sam.marrocco correctly pointed out, it is not mandatory to enable access, and is best practice to quarantine software before enterprise-wide rollout.

More bewildering to me is how few people will get fired.

Every participant in this failure is complicit, whether they are willingly complicit or dumb as rocks.

It is after all, the responsibility of the CIO, CSO and every engineer working In security or IT to mitigate or prevent such activities.

Evidently security certification, corporate training and best practices are meaningless.

2 Likes

We are living in Mr. Robot.

1 Like

There were rumours that Windows 11 was going to be a Windows UI over a Linux backend, which of course it didn’t end up being.

I wonder if, by the nature of the Linux kernel, that something like this Falcon issue would be easier to patch and come back from if it was running on Linux. Not saying it definitely would be but I’m thinking you’d still be able to mass access Linux infrastructure to do it rather than having to manually go to each machine to fix it (due to the whole Windows OS hanging). Not even trying to pretend to be an expert here though so feel free to correct me but I suspect enough of the OS would continue to run in the background to patch and reboot remotely.

One major bank in Australia was advertising how secure their customers were as the were protected by “The Falcon”. Don’t expect to be seeing any of those adds for quite a while.

All enterprise grade windows software permits automatic updates in a tightly controlled manner.

The software will download all patches or upgrades, and provides a quarantine state by running these tools on virtual machines in hyper v before deployment at scale, not to mention providing roll back states if necessary.

This activity is unforgivable, but of noticeable interest because Crowdstrike were instrumental in discovering state level activity for other global events.

Their credibility just took what could be a fatal hit.

Absolutely correct, philm.

That’s the most shocking part of this, as you said, was that any company would enable auto installs of anything on a system used by hospitals, airlines, etc. Yet they all did it, so as you stated, it’s either complacency they were dumb as rocks.

Linux, Windows or Mac, once the damage is done by giving it the permission to auto install, it would involve replacing the low-level driver that is preventing the machine from booting and is already in use. You’d need to boot into some safe-mode/pre-drivers environment to be able to fix the driver(s), or boot to another device and fix the issue on that drive before the OS runs.

We use TrendMicro, which is much simpler than Falcon, for some of our machines, and these security apps have their hooks sunk pretty deeply into the three major OSes. They’re also nearly impossible to un-install (by design) if you forget a password.

No tool or process is ever 100%, regardless of how many people you fire and how many layers of testing you insert. Even though the fact that it impacted so many system, speaks against a test failure / corner case.

Either way, the reason this one had so much impact compared to other software failures, is that it crashed the system hard on boot, and for many system it would require physical presence for the recovery. That’s very complicated in large enterprise environments.

For a case like this, it’s more about how can the recovery be improved, more so than how can the problem be avoided. If that could have been fixed via a simple follow-up deployment, this would have be a much smaller news story.

Firmware updates are a similar problem. But most good hardware uses two firmware banks and alternates between them. So if an install goes south, you can switch to the standby bank with the previous version and life goes on.

Of course there is always another layer underneath. What if you have to update the boot loader code, which is responsible for managing the firmware banks. We had one such case at HP way back when. You can always find a layer where there is no redundancy, but hopefully you never have to dig that deep.

Chromebooks actually uses the same principle as the two firmware banks. On every OS updates it switches banks and keeps the pre-update copy. There is an interesting case study where a big hotel chain was attacked by ransomware during the pandemic. All Windows PC, no checkin. The IT team solved it by deploying a few palettes of of Chromebooks to replace the PCs. Back then they still need a USB stick with config data, nowadays they can dropship directly and all the config is loaded from the enterprise login. It’s a good solution for lightweight end points like point of sales or hotel front desk. Even though otherwise I’m not a big fan of the Chromebooks (I’m familiar with this case study because I edited the video promoting it for Google).

And @philm, I do take issue with the notion of having to fire someone for this snafu. At least as a knee jerk reaction. Once the causes are known, this may be justified.

But I have first hand experienced a culture that had to do high risk deployments on large infrastructure. If you live by the threat of firing squad, nobody is going to take that job. Amazon (at least back then) had a culture that combined accountability with teamwork. Nobody looked for the fall guy to blame. Everyone worked together to fix issues, Each team owned their own operations. If you caused an issue you had to write up a COE and own the action items to avoid recurrence and that was it. As a result our average MTR for severity 1 incidents was 15min. You get the best of the best working togehter with just one goal - get it running again. In my years there, only one person got fired, and that was under extreme circumstances and certainly warranted.

@allklier - I respectfully disagree with you brother - every manager of every administrator in this situation is culpable - no excuses - they all failed - every single one.
It’s not a firing squad - this isn’t mid 20th century europe - people won’t get shot - they will just lose their grossly over-compensated positions and rightly so - these clowns were all asleep at the wheel - there isn’t an alternate reality where this didn’t happen.

Unfortunately then we will all be worse off, because good engineers who know how to do this without bringing the world to a standstill will refuse to take these jobs, and the cycle will repeat.

There does need to be accountability for this incident and not repeating it. But better leadership techniques are needed, not cancel culture.

You need to eliminate the structural deficiencies. We have so much toxic work culture that contributes to this kind of situation. From stupid budgets, stupid schedules, egos, bro culture and more.

People who have pride in their work and are respected don’t let stuff like this happen.

Respectfully brother, my opinion is not about cancel culture at all, and I’m a little surprised that you would think that, let alone say that.

The people responsible for these outages should take responsibility for their deficiencies, including leadership.

Your suggestion that good people won’t take those jobs is specious: good people evidently did not take those jobs, hopefully those bad employment choices will be replaced by good choices.

It’s also remarkably suspicious that crowdstrike performed a global rollout of a single point of failure.

This has nation state level paw prints all over it.