Anything else I should be looking at besides replacing the drives?
I have a feeling the volume is not recoverable because 2 of the 4 drives have errors.
As they are SSD, it is not uncommon for them to fail at the same time in a RAID. As there are a finite number of writes on a SSD drive, and the fact they are in a RAID together likely writing the same amount of data at the same time, it greatly increases the likelihood of multiple SSD drives in a RAID failing at the same time.
Certain NAS/SAN/RAID systems monitor and manage a SSD RAID to ensure this does not happen by the way it spreads the write load. Unfortunately, It seems that your setup doesn’t have anything like that.
It could be something entirely different to the above but it is the most logical at first glance.
Oh boy…same RAID we have. I had SO many problem with their SoftRaid and RAID5 - it would dismount twice a day, and then in first four weeks the entire thing crapped out and I lost everything. Their support couldn’t help.
Restriped as RAID0 with the Apple Raid Utility and been fine for over 15 months. (we have daily backups, so not too worried about raid0)
I have been using softraid on the macs for many years and have rarely had any issues.
I opened up a ticket with OWC report and ordered two replacement SSDs as well.
Going to swap out the SSDs and see if i can get things working again.
Would love to implement a central S+W Server however its not always so simple to get cost approved as well as deploy such a solution with minimal downtime.
I am the only engineer we have on staff. My confidence to be able to spec out and deploy such a solution is not exactly the highest. I know i could get it done, I just lack the experience to spec the hardware, purchase, implement and support this type of solution without making some mistakes along the way an causing downtime.
Bringing in an outside consultant to help set things up is not out of the question but at the end of the day i need to be able to understand and support the solution without having to engage a third party.
Out of interest @ALan, would you even take that approach if you only had a single Flame in the building? Not being condescending at all, genuinely interested.
In the case of a SSD based solution, do you have a solution to avoid synchronised failure (if that was the cause)?
What exactly is a synchronized failure?
looking at the softraid interface and the log, this is my assessment of what happened likley.
(I am waiting to hear back from OWC support this morning on the exact cause)
I am just wondering if the drives are actually bad or if its. a sofware thing…Would it be possible the hardware is fine and I can just make a new volume, restore from my backup and keep going?
Yes, to trying to get some separation. But don’t pick widely different ones. In the old days you would try to get same drive but with wider range of serial numbers into a RAID. Ideally you want similar or identical specs so that slight variations in latency don’t cause new issues, but they haven’t been made on the same production line on the same day in case there was some contamination or calibration was off.
Secondly, while it’s not a bad habit to stack the cards in your favor by this diversity, ultimately any storage system should follow the 3-2-1 rules: 3 copies, 2 systems, 1 offsite.
Now for working storage - ignore the 1 offsite for a second. But that still means two copies on two separate systems.
You could have built a RAID 6 with two hot spares. Or mixed SSDs. At the end of the day, there is always a chance that the whole system crashes. You need to be able to recover the data in an appropriate timeline from an alternate source. Hedging your bets against double drive failure only changes the probability that you have to do it.
Alan’s central server is the equivalent of the 1 offsite, except it’s on-prem (or most likely).
There’s a different tradeoff. While the central server makes the nodes disposable and you can trash them at will w/o data loss, you created a new single point of failure. Now that means you can focus all your effort in providing redundancy on that server rather than having to do this at each node. But you’re still creating a different problem you have to solve.
It depends on what your goal is. Different RAID levels have different purposes.
Capacity & cost, speed, redundancy.
RAID 5 was a means of getting more speed out of spinning drives at reasonable cost, and reasonable redundancy at reasonable cost. A lot of compromise there. But that is for huge data sets and expensive drives.
Smaller working storage, drives are affordable enough that you can do RAID 0. There’s a downside though - RAID0 is statistically the highest risk. RAID 0 goes down if either drive fails, and since you have two drives, you have twice the probability. That’s fantastic for fast ‘throw away’ storage like cache.
In fact a RAID 10 would be a better choice. That’s two redundant stripes. It’s less space efficient as RAID5, but also rebuilds faster, since it’s a straight copy, not a rebuild from parity.
You could also just use RAID1, for the size. NVMEs are already much faster than spinning drives anwyay.
Check out specs, test, and figure out what is most important for you.
PS: RAID5 can rebuild from one failed drive, RAID6 can rebuild from 2.
Also, all this only works if you have alerts that tell you when a drive failed. We have seen folks lose data becasue they had no alarm on drive failure. So the RAID continued to work without redundancy for months, until the next drive failed.
I am debating just trying to build a new volume and keep using the same hardware (just raid0)
I have a feeling the SSDs inside are fine, but I kind of want to wait to see how OWC responds to my support case before giving up on the volume completely.
I looked back at my convo with OWC support. They ultimately decided the APFS was corrupt and no known solution to restore the RAID. This was early Ventura days on Softraid 7.5