Lucid Link down

Its been down all day, apparently they got hacked and are having issues restoring the service, join their slack to see whats going on in more or less realtime.

ugh. I have backups from last night of course so i am chilling out most work lost is like half a day of the whole studio, nothing much I can do.

this isnt great and shows again how trusting a core asset to a third party for convenience is not the best idea always.

(could have totally happend to my own infra as well…)

Wait is it a hack or are they getting DDoS’ed?

they said ddos but now hey are talking about possible data los and a malicious actor… its a developing story on their slack server

1 Like

It’s been an interesting day. Definitely sounds like more than just DDoS.

They’re rolling the meta data service back to an earlier sync point from their backups.

They were just flying high at NAB and this is a pretty rough landing. Though they could use some crisis comms folks. I would not have come out and said ‘you may lose data’ so early and so public.

We’ll see how they untangle. Their EULA is pretty clear that they’re not liable for any work or time lost by clients. But the PR might be another matter.

Yeah it’s been some kind of fucking day

mayhem in /r/editors

1 Like

they reached out to me personally via email as well ( the german support), they went full force even on a public holiday…

fingers crossed this doesnt take them down

1 Like

Close to 48hrs. And widespread impact.

Will have to see what the fall-out is. Some of it depends on the eventual explanation of root cause.

There definitely could have been better response plans, and possibly some missed opportunity to harden critical failure points against attack. But hindsight is 20-20.

I hope cool heads prevail and we all learn from this, both infrastructure providers on how to minimize that, and users on how to handle infrastructure being down, which can always happen for many reasons. No need to line up the firing squad. Though I’m doubtful in this blame first, think later time we live in.

We’re up in NY, but the time stamps in one of our folders is all August 22, 2019.

i asked my personal sales guy and he says there is not even a way they know what exact time their backup is from in terms of my filespace.

I can only inagine the mayhem this means for bigger studios. rough.

I think we were luckier here in the states. I think it went down during our overnight hours.

1 Like

While unfortunate, as long as the timestamps are older rather than newer, any automated dependency checks (tools that decide what needs to be updated based on timestamps) will still do the right things.

Well the first version of the root cause analysis is in.

Confirming what we suspected.

A combination of a mass data corruption in the metadata server that triggered a self-inflicted DDoS on the discovery service. They essentially had to rebuild an entire fleet of thousands of meta data servers both in terms of infrastructure and data, while also checking and understanding all the security issues.

On an executional level, very impressive and textbook case given the scale and complexity. And all the backups worked :slight_smile:

The root cause analysis is still soft on well the ‘root cause’. What they know is that all the disks of their metadata servers got overwritten with random data in multiple regions and timezones. They assume that someone gained access to an internal server that had elevated access to the production environment, and used that to launch the attack.

What doesn’t totally compute is that this attack took down thousands of servers around the world all at once. This requires a script that is deployed through the same mechanisms that push out software and config changes to the server fleet. Which leaves the door open that this may have been a very unfortunate self-goal.

Not sure we’ll ever find out for certain. Or if we find out, it was a bad guy, if not, well silence speaks too. That being said, these things can happen to the best.

For me the biggest take away is that the engineering effort of the recovery, even though it took 48hrs, is very impressive, with a lot of skill and resources. Ultimately that is what you should pick your vendors for, not who can sell you the cheapest cloud storage with a fancy slide deck, website, and story. There are a lot of companies out there that would struggle to match this on an operational level. So don’t be cheap.

Also, this type of effort only works in cloud environments. On-prem, this outage would have been a lot longer. The old days…

2 Likes

It’s always DNS.

4 Likes

I still think Discreet’s fix for the “end-of-time” bug in the '90’s was the most impressive.

Not familiar with that one, will have to look it up if there’s a write-up.

In my book the most impressive (although not a fix, but a migration) was when Amazon moved its EU websites from a Virginia data center to an Ireland data center. The websites were never offline, and there was just a 10min period at midnight where the checkout was disabled, between the last Virginia based order and the first Ireland based order (to replay the last log file of the order data Oracle DB). That was all pre-AWS. Literally moving the 3 major websites along with all the data across an ocean without downtime. All those people that email you ‘our website will be down for maintenance for 6 hours on the weekend’ are just slackers.

ps: I was tangentially involved in that Amazon migration. Back then my team was still in charge of the home pages and we had to manage the messaging around the checkout conditions. The infrastructure teams took care of moving all of our databases. I was part of the coordinating team - large conference room and lots of laptops and pizzas.

2 Likes

I was always impressed by Magnus Glantz @ IKEA, and Mattias Haern at Red Hat.

IKEA vs. SHELLSHOCK

I think it was 2015 when they did it.

Two people vs. the world.

Some dev decided a “random” number would be enough to mark the end of operational time of Discreet’s stone arrays. They sent out patches to fix the software, our reseller was taking too long to service all the clients. I had to do their work on all of our SGI systems. @ytf would add some corrections probably.

1 Like

I thought it wasn’t random. It was an SGI limitation that they were aware of, but knew they had a few before it would be a problem, then forgot about it.