Two SSDs in Our Framestore RAID Fail at the Same Time

Using an OWC thunderblade.
(4) 4tb SSDs configured in a software raid 5.

Its pretty rare we see one of these fail but two at the same time seems like there could be something more going on…perhaps the enclosure itself.

Anyone have any experience with this type of issue?


This is what I see inside softraid.

Anything else I should be looking at besides replacing the drives?
I have a feeling the volume is not recoverable because 2 of the 4 drives have errors.

As they are SSD, it is not uncommon for them to fail at the same time in a RAID. As there are a finite number of writes on a SSD drive, and the fact they are in a RAID together likely writing the same amount of data at the same time, it greatly increases the likelihood of multiple SSD drives in a RAID failing at the same time.

Certain NAS/SAN/RAID systems monitor and manage a SSD RAID to ensure this does not happen by the way it spreads the write load. Unfortunately, It seems that your setup doesn’t have anything like that.

It could be something entirely different to the above but it is the most logical at first glance.

Oh boy…same RAID we have. I had SO many problem with their SoftRaid and RAID5 - it would dismount twice a day, and then in first four weeks the entire thing crapped out and I lost everything. Their support couldn’t help.

Restriped as RAID0 with the Apple Raid Utility and been fine for over 15 months. (we have daily backups, so not too worried about raid0)

2 Likes

shall I say it ??? … Centralized S+W Server for the win!!!

3 Likes

@ALan - yes! say it!

sunset boulevard 1950s GIF

There are a lot of folks that had issues with this particular solution. Seems marginal for critical data.

Software RAID in theory should be fine, but there are more things that can go wrong than a well contained hardware controller.

Thanks everyone for the replies.

I have been using softraid on the macs for many years and have rarely had any issues.
I opened up a ticket with OWC report and ordered two replacement SSDs as well.

Going to swap out the SSDs and see if i can get things working again.

Would love to implement a central S+W Server however its not always so simple to get cost approved as well as deploy such a solution with minimal downtime.

I am the only engineer we have on staff. My confidence to be able to spec out and deploy such a solution is not exactly the highest. I know i could get it done, I just lack the experience to spec the hardware, purchase, implement and support this type of solution without making some mistakes along the way an causing downtime.

Bringing in an outside consultant to help set things up is not out of the question but at the end of the day i need to be able to understand and support the solution without having to engage a third party.

1 Like

Out of interest @ALan, would you even take that approach if you only had a single Flame in the building? Not being condescending at all, genuinely interested.

In the case of a SSD based solution, do you have a solution to avoid synchronised failure (if that was the cause)?

What exactly is a synchronized failure?
looking at the softraid interface and the log, this is my assessment of what happened likley.
(I am waiting to hear back from OWC support this morning on the exact cause)

I am just wondering if the drives are actually bad or if its. a sofware thing…Would it be possible the hardware is fine and I can just make a new volume, restore from my backup and keep going?

  1. Absolutely.

  2. We’ve never had an SSD fail. But easiest solution is to just put a few different model or manufacture SSD in there.

Whats the logic behind populating your array with diffrent manufactures?
Less likely to all fail at the same time?

Which one @ALan Alan? StorNext SAN? Something else?

correct.

any fast reliable storage.

Question of the day…Do I skip raid5 all together and just go full stripe + nightly backups?

Seems like the synchronization is what screwed me in the end. With SDDs I never had such a problem with raid0

Two considerations:

Yes, to trying to get some separation. But don’t pick widely different ones. In the old days you would try to get same drive but with wider range of serial numbers into a RAID. Ideally you want similar or identical specs so that slight variations in latency don’t cause new issues, but they haven’t been made on the same production line on the same day in case there was some contamination or calibration was off.

Secondly, while it’s not a bad habit to stack the cards in your favor by this diversity, ultimately any storage system should follow the 3-2-1 rules: 3 copies, 2 systems, 1 offsite.

Now for working storage - ignore the 1 offsite for a second. But that still means two copies on two separate systems.

You could have built a RAID 6 with two hot spares. Or mixed SSDs. At the end of the day, there is always a chance that the whole system crashes. You need to be able to recover the data in an appropriate timeline from an alternate source. Hedging your bets against double drive failure only changes the probability that you have to do it.

Alan’s central server is the equivalent of the 1 offsite, except it’s on-prem (or most likely).

There’s a different tradeoff. While the central server makes the nodes disposable and you can trash them at will w/o data loss, you created a new single point of failure. Now that means you can focus all your effort in providing redundancy on that server rather than having to do this at each node. But you’re still creating a different problem you have to solve.

There are no easy wins.

That’s my solution. RAID0 and it’s all backed up daily.

In our case the OWC NVME were fine, but Softraid+Apple somehow scrambled the RAID5. Couldn’t even do a rebuild which was the whole point of RAID5.

It depends on what your goal is. Different RAID levels have different purposes.

Capacity & cost, speed, redundancy.

RAID 5 was a means of getting more speed out of spinning drives at reasonable cost, and reasonable redundancy at reasonable cost. A lot of compromise there. But that is for huge data sets and expensive drives.

Smaller working storage, drives are affordable enough that you can do RAID 0. There’s a downside though - RAID0 is statistically the highest risk. RAID 0 goes down if either drive fails, and since you have two drives, you have twice the probability. That’s fantastic for fast ‘throw away’ storage like cache.

In fact a RAID 10 would be a better choice. That’s two redundant stripes. It’s less space efficient as RAID5, but also rebuilds faster, since it’s a straight copy, not a rebuild from parity.

You could also just use RAID1, for the size. NVMEs are already much faster than spinning drives anwyay.

Check out specs, test, and figure out what is most important for you.

PS: RAID5 can rebuild from one failed drive, RAID6 can rebuild from 2.

Also, all this only works if you have alerts that tell you when a drive failed. We have seen folks lose data becasue they had no alarm on drive failure. So the RAID continued to work without redundancy for months, until the next drive failed.

2 Likes

Seems like this is what happened to us exactly.

I am debating just trying to build a new volume and keep using the same hardware (just raid0)

I have a feeling the SSDs inside are fine, but I kind of want to wait to see how OWC responds to my support case before giving up on the volume completely.

I looked back at my convo with OWC support. They ultimately decided the APFS was corrupt and no known solution to restore the RAID. This was early Ventura days on Softraid 7.5