NVME Raid - suddenly stopped working!

Speaking of heat and specifically NVMe - that’s definitely something to keep an eye on. They do get quite hot when you transfer data at high speeds. Especially if you write to them > 2GB/s over longer periods. In one system I have two Firecuda’s and it took the folks building that system a lot of time to get them cooled properly.

Now they’re supposed to monitor their temps and then throttle back as needed rather than go MIA, just like CPUs and GPUs. Which is why it’s good to test them every so often. Because if you don’t have enough cooling them, they might work, but just run a lot slower than you think they are, because they’re throttled on temp.

The other aspect is that constant huge temp swings on the circuit boards make them more susceptible to errors as they expand and contract, especially around the connectors. Way back (like 10 years) Apple had a whole batch of MBPs where heavy GPU loads would break solder points and then randomly panic the system.

1 Like

I had one of those MPBs…didn’t fail but you risked third degree burns if you used it while wearing shorts!

I would run a Time Machine on the system disk, then, in recovery, you could try to copy the frame store disk to another volume. If you rename the volume the same as the raid, flame should find it.
Good luck!

1 Like

Hi Sam…not sure why you would run Time Machine on the system disk when thats not the disk with the problem …unless you are saying to run time machine whilst already in recovery mode? Can you run Time Machine in recovery?
I’ll look into it on Monday…thanks for the suggestion.

Sam’s advise is solid. Time machine is a good idea, but not for the data recovery of the RAID, but in case in your attempts to recover the RAID you make changes to the drive config you’d like to undo. You would have an OS snapshot to restore to (albeit painfully). For OS level config data. If they’re stored in controller configs, Time Machine won’t help.

Anytime you make critical changes to MacOS itself (upgrade, config), it’s always worthwhile to have a Time Machine snapshot just in case. Many years back an upgrade went sideways and my MacPro got stuck in a boot loop. Time Machine was my ticket to reverting that.

Having said that, given that we determined that one of your physical disks appears to be missing, you may have a different problem. As a striped RAID 0 volume, unless you get that drive to re-surface, you are out of options for recovery. There’s no redundancy and the data is striped across the four drives, so you’re missing critical parts. Your only hope is that the drive itself hasn’t failed but that it’s simply an unreliable connection that is resolved by reseating it.

What do you have in terms of backups in case the RAID is lost? Hopefully as last resort the Flame Archive on a different volume which you could restore, and then re-cache and re-render everything as needed. Maybe a few hours of just prior to reboot that’s gone gone?

This kind of stuff is everyone’s worst nightmare. You’re on a client job with tight deadline, and suddenly you lose a day (hopefully not more) of work and time to get back to where you were.

We just had that conversation about that Thursday. The reason freelancers and facilities don’t charge the same rate (and shouldn’t) is the resiliency to such failure scenarios among others, and the clients taking calculated risks on that in order to save cost on their end. Though we always do our best to minimize it within reason.

PS: If you have a different M.2 dock available, I’d try to see if that NVME stick comes up outside of your Highpoint controller to troubleshoot if it’s the drive, the socket, or the controller. With that info, if simply reseating doesn’t work but the drive works in a separate dock, you could move it one of the other slots on the controller and hopefully it sees it there, since you only use half the available slots, if I saw the info right.

1 Like

Hi Jan,
I have all the Autodesk project files on the system drive, and i had saved the project prior to closing Flame…the crash happened when i was using ProTools. So i should have all the relevant data.
Only the cache/proxies are on the Framestore which is the nvme, but all the source files are on a separate raid.
The system drive is backed up to a synology should i need it.
Hopefully i can get the ssd to show up but if not i can direct Flame to a different mediastore and reconnect the files….like you say i might get lucky and just lose a few hours.

I hate technology…

1 Like

I really like the idea of moving the NVME drives to an external enclosure. With any luck they’ll mount right up, and you’ll be working again, albeit a little slower.

first thing this morning i removed the raid card, reseated each chip and re-installed it.

There is plenty of air around the card, and dust was at a minimum. The only odd thing i did spot was what looked like some kind of liquid staining on the chips…possibly its just discoloration from the chips heating up, as there is no other evidence oif liquid in the mac.

Anyway, rebooted in Recovery, and Disk Utility shopwed three chips again, until i stretched out the window and realised that all four chips are showing and all say they are ok.

restarted the mac and yet again it says the disk is unreadable and cannot mount it…so now i am fairly certain that the raid is ok and its an OS problem, and as such i am now re-installing Ventura in the hope it will resolve the disk mounting problem.

1 Like

Best of luck Adam…

1 Like

Good updates. Bad UI design to not have a scrollbar there…

Best of luck. Hopefully the OS re-install clears it up.

If the OS turns out to be the culprit, it may be worth keeping an SSD on hand for Time Machine. That would allow you OS snapshots to restore rather than having to start with virgin OS.

and so it continues…

i’m just adding these steps in case anyone else has a similar issue…might just help.

Re-installed OS using Recovery Mode (took a couple of attempts as something was causing problems - not sure what.)

Restarted Mac and eventually the screens all appear - and the missing drive is now mounted and can be opened…but Wacom driver refuses to start or connect to the tablet.

So i restart again. This time i get the error about not being able to read the drive, and wacom does start up. As the initial cause of all this started after Wacom froze, i’m now thinking its probably Wacom related.

So i have downloaded the latest driver and before i reinstall that, i’m running a Time Machine backup of the system drive.

2 Likes

Well, that’s an interesting twist…

An I/O conflict of drivers. Even though one is PCIe and one is USB. Maybe instead of the latest driver go back a few months for the Wacom?

There used to be a time where Wacom drivers always had issues and you had to resist updating to the newest. But thought they had worked through that…

1 Like

Also - do you use the Wacom via USB cable or Bluetooth? Not that either is wrong, but toggling that may be another useful variable, or way to get past the drama and back to work for now.

whats odd is its been working without issue for six months since i installed the nvme. I have not updated anything recently, so unless something else has corrupted the wacom software and thats causing a problem, i’m baffled.

Its encouraging that the nvme did appear for a while and i wss able to mount it, open a folder or two and see content. If the raid had failed i wouldnt be able to do that.

2 Likes

always via cable - tried the wireless method years back and the lag was too much to cope with, made worse by the fact the mac is in a rack room and probably 7-8 feet away through a wall. Not great for the signals

Right, the BT is interesting but not as useful for most. Plus the charging hassle.

Since you mentioned that there’s a long distance between desk and server, maybe the issue could be with one of those cables? Got pinched or otherwise worn. Electric interference from some new piece of gear. Maybe just for testing move the Wacom closer to the Mac on a different (normal length) USB cable?

Years ago, after working perfectly well, my battery backup UPS caused my Trashcan to come to a crawl and needing a reboot every 30min. Took a while to diagnose. Turns out I had unplugged the RS232 port from the UPS (not something you think much about). Their poorly developed driver then spawned itself by the hundreds cluttering the USB bus to point of all I/O becoming unresponsive. Took several days to connect the dots finally.

I mentioning this as sometimes totally unrelated small things can have unexpected consequences.

1 Like

Grrrr…still not working. Have reinstalled the highpoint driver but not mounting the drive. I think the next option is to reformat the system drive and install everything fresh!

At least i now have two separate clones of the drive so i should be able to recover all the autodesk files.

1 Like

finally managed to get everything working again.
Involved erasing and reformatting the system drive, re-installing everything, removing several 3rd party extensions (LaCie Raid & Wacom), re-installing the Highpoint driver, then still getting nowhere…

Was getting really annoyed that both Recovery and Safe modes could see the nvme, but not the standard boot mode. Eventually found a newer driver on Highpoints site that installed and allowed the nvme to mount!

10 mins later an email from Highpoint tells me about the newer driver, but also some interesting info highlighted in red…

“We have had feedback from customers that the driver does not load when using the controller in Mac Pro 2019 in Slot4, if your SSD7140A is inserted to slot4 we recommend you replace it to slot3 or 5.”

Mine is currently in slot 4 so as soon as its finished the backup i’ll be swapping the slot to a different one.

edit: Thanks Jan, Sam, Waldi and BenV for all the help and advice on working through this problem.

6 Likes

First of all glad to hear that this ordeal is nearing it’s end.

Also fascinating. Must be some PCIe lane assignment aspect that’s different about that slot. PCIe lanes create a lot of complexity for folks and are not very transparent.

It leaves the question what changed to trigger it suddenly failing when it was fine for quite a while.

seeing how it was visible immediately after the OS re-install, but then vanished after the two device drivers were added, I am pretty certain that it was the LaCie Raid Manager app that was at the root of the issue.

Its also possible the problem was exacerbated when i thought i had loaded the driver for Highpoint, yet found out after that is was not the latest version.

And Flame fired up first time and was seeing all my projects again…with no loss of project data!

Happy chappy with four days of tech nightmares i won’t get back!

2 Likes