OpenZFS crash course

A post was merged into an existing topic: SoftRAIDers, how do you SoftRAID?

@randy
Excellent news!
The system specs for each enclosure state 1150MB/s so you’re doing well.
Also of note, if you were previously running raid 5 and then copying to another drive you would have Been running at half that speed.
So you’ve cheated the system, you’ve got more authoritative parity and your read speeds have doubled.
(raid 5 is one parity disk so you’d have had 5x8TB per enclosure and one parity then all backed up again to the second enclosure - now you have 6x8TB in each enclosure working synchronously!)
Great job!
As for the write speed - yes it’s shit - that’s spinning disks for you - the way to mitigate is to get the m2 NVMe sticks for each enclosure.
That way writes get buffered (and as a neat byproduct they organize the data on the spinning disks much better because of the queue control)
So two big improvements today.
Your writes are probably doubled (or more) and your reads are doubled (or more).
So now comes the litmus test:
Do you add NVMe as four individual cache devices?
(Which is the recommended way)
Or do you raid 0 on the high point and use that as one cache?
TBH it doesn’t matter either way.
If the hardware raid0 fails it doesn’t corrupt your files in the zpool, remember it’s just a detachable cache - like a pressure washer on a garden hose.
If zfs cacheing doesn’t seem fast enough then use the hardware RAID.
If zfs cacheing is fast enough then use 4x individual sticks and let zfs decide.
Or you can try what we talked about on the phone and partition the hardware raid and use part for cache and part for ZIL:
Make a hardware raid of all four NVMe of say 7.2TB and use the remaining 800MB for ZIL
And remember, you can add and detach cache and log devices in seconds with a single command line so you can experiment quickly.
I’m so happy that you’re trying it.
You have many options (and all the other ones we talked about by phone)
Don’t forget to link aggregate your network connections too.
And look out for cheap 40Gb Ethernet NICs on eBay to run point to point between the two macs.
Endless weekend fun.
How’s the gin?

This conversation has been utterly fascinating.

2 Likes

@randy
I sent you a pdf by email of some simple notes that I have used in the past to make openZFS volumes for Mac.
I’m unable to upload it from my phone or from the computer where I made it.
Would you kindly pop it in this chat for me?
Thanks.
It’s been fun talking with you about this stuff.
I hope Alan can chip in with his experience of openZFS on centos.
(Otherwise you’ll be stuck with me gibbering on about halves and terabytes of ram and oracle infiniband)

1 Like

Ah
Actually your write speed has depleted from over 500MB/s
That’s disappointing.
Your read has only increased by 20%
Gin
Makes your brain don’t remember properly no more!!!
:slight_smile:

So $110 will fix the read speed, or maybe the NVMe partition will fix the read speed.

Today was amazing. Last night was the first time I’ve had the pleasure of connecting with @philm and its as if I was lost in a big city, stumbled into a stranger that turned out to be the mayor and got an incredible tour of OpenZFS.

Ill take care of the pdf @philm…I got you.

Many, many thanks.

2 Likes

@randy Seriously this community is incredible… harkens back to the vibe of flame-news.

3 Likes

@randy
The pleasure was all yours
No wait!
The pleasure was all mine!
Gin!
:joy::rofl:

1 Like

In closing our 90 minute phone chat with @philm, I shared something along the lines of:

“Every week there’s something that happens that proves what we are doing is working.”

Week 32. Check.

1 Like

@randy
Also I’m easily flattered by the notion of mayor.
I thought you were going to say overly complicated bus driver or worse.
:joy::rofl:

@randy
I think I know why the write speed is low:
The script I shared with you via email had compression set to lz4.
This means that the files get compressed before being written, which takes up more cpu time, more ram and slows down the write.
If you have little or nothing on the test volume then try building it again without the -O compression=lz4.
First of all type zpool history
This will present you with the command that you used to construct the pool.
Copy it but omit the compression parameter.
Also
You may want to switch the case sensitivity to sensitive.
In the pdf that I sent you are all the link you need to construct zpool destroy command.
I will also email you post zpool creation commands to enable nfs sharing, apple sharing, samba sharing etc and examples of how to build zfilesystems that can have quotas and such.
That way you can enable permissions per zfilesystem and manage storage in a better way
Also there are a couple of utilities to install that permit snapshotting - think time machine but much better

@randy
Ok new info
Forgotten tidbits
New pdf
All in your email inbox

1 Like

Basic steps…

  1. start the mac.
  2. plug in thunder bay 1.
  3. add a drive.
  4. open a terminal.
  5. run the command: diskutil list
  6. run the command: diskutil info diskXyZ (where XyZ is the value that shows up in the shell e.g. disk2s1)
  7. make a note of the physical disk position in the bay and add the volume UUID that showed up in the shell to a new textedit file or a spreadsheet.
  8. repeat steps 3 through 7 until you’ve mounted all the disks.
  9. plug in thunder bay 2.
  10. repeat steps 3 through 7 until you’ve mounted all the disks.
  11. construct the zpool creation script in a text editor:

zpool create -f -o ashift=12 -O casesensitivity=sensitive -O atime=off -O aclmode=passthrough -O aclinherit=passthrough -o failmode=continue -o autoexpand=on -O mountpoint=/mnt/my-new-zpool my-new-zpool mirror /dev/disk1 /dev/disk2 mirror /dev/disk3 /dev/disk4 mirror /dev/disk5 /dev/disk6 mirror /dev/disk7 /dev/disk8 mirror /dev/disk9 /dev/disk10 mirror /dev/disk11 /dev/disk12

!!!Before you execute, you MUST replace each /dev/diskXX with either disk numbers that you have or with the UUID!!!

  1. paste the zpool creation script in to your terminal.

And here’s @philm 's document…here’a bit of the opening few lines for future searchers.

Here are some links, the majority of which are related to OpenZFSonOSX:

OpenZFS on OSX
Downloads
Install
FAQ
Creating a Zpool
Introducing ZFS Properties
Solaris ZFS Administration Guide
zpool ( 8 ) Linux man page
ZetaWatch
ZetaWatchZFSSnapshotting
ZFS Snapshot Tutorial
ZnapZend
Install ZnapZend on Mac with Brew
Four Ways to use ZFS Snapshots

This document contains links to other documentation, opinion editorial and other sources of information that I used to build OpenZFS RAID systems for use on OSX/macOS for my own personal use.I am not offering any warranty,expressed or implied.

Use this information at your own risk & make archives and backups before you experiment!

OpenZFS-on-OSX-notes-v2.pdf (83.4 KB)

I skimmed this thread. Here are some notes.

ZFS compression usually is not the bottleneck. In fact it can actually be faster than writing uncompressed. Take for instance a RAW frame store file. If that is a matte or some other type of image that can be compressed. Lets say its a 30meg raw file, but can be compressed to 10 megs, you have just reduced your data load by 2/3. That is 2/3 less to write to desk, and 2/3 less to read. If you are doing graphics, this will save you tons of space. LZ4 and the new Zstd (available in ZFS 2.0) can compress several thousands megs a second on modern hardware, plus they are also designed to give up trying to compress a file if the first few percent of a file don’t compress well.

You can turn on/off and even change compression algorithms at any time, but only new data will take that affect. Already written data stays as it was.

At least in my experience with CentOS & ZFS, it was consistently about 3-4x slower than hardware raid and XFS.

ZFS snapshots are amazing, and totally saved my ass once, when I accidentally erased the whole NAS overnight. In the context of a frame store, I don’t think snapshots matter much though.

If you only need ~1gig/sec then ZFS can do it, hardware dependent. Anything more than that, you need to have tons of RAM to cache the totality of the content you want in RAM, and you need tweak the config files appropriately, as it normally does not cache sequential reads.

I’ve setup ZFS nas’es with the ideal hardware, 120 spindles, SSD ARC, half terabyte of RAM, NVMe ZIL, and it was always just a massive disappoint in performance. Setup the same hardware in hwraid/XFS and we get 5gigs/sec. So now all are NASes are as such.

@ALan
Nice
Hardware raid can always be faster.
This thread started because of software raid.

@ALan
I set up a 120 spindle system back in 2016 that served 20 workstations with no downtime.
You came round to view it.
It wasn’t possible at the time to get much more out of it than 1GB/s because it only had 1 10Gb Ethernet port.
Eventually we got it to 4GB/s over 4x10Gb bonded.

Prior to that I set up 120 spindle oracle system in 2015 that churned out 8GB/s over bonded infiniband to a z820 with no framestore.
The oracle engineer guaranteed 32GB/s from that system.
But the price tag was nearly half a million US.
The 3TB of ram per head end was close to $250k at that time.

yep, as long as it is coming from RAM, everything is speedy. Its the coming off disk part where ZFS is slow.