Major features added in release:
- RAIDZ Expansion: Add new devices to an existing RAIDZ pool, increasing storage capacity without downtime.
- Fast Dedup: A major performance upgrade to the original OpenZFS deduplication functionality.
- Direct IO: Allows bypassing the ARC for reads/writes, improving performance in scenarios like NVMe devices where caching may hinder efficiency.
- JSON: Optional JSON output for the most used commands.
- Long names: Support for file and directory names up to 1023 characters.
More specifically:
> A new device (disk) can be attached to an existing RAIDZ vdev
2025-Jan-14-1258.93743_Experiment-2345_Gas-Flow-375.3_etc_etc.dat
But to me this stuff should be in metadata. It's just that we don't have great tools for grepping the metadata.
Heck, the original Macintosh FS had no subdirectories - they were faked by burying subdirectory names in the (flat filesysytem) filename. The original Macintosh File System (MFS), did not support true hierarchical subdirectories. Instead, the illusion of subdirectories was created by embedding folder-like names into the filenames themselves.
This was done by using colons (:) as separators in filenames. A file named Folder:Subfolder:File would appear to belong to a subfolder within a folder. This was entirely a user interface convention managed by the Finder. Internally, MFS stored all files in a flat namespace, with no actual directory hierarchy in the filesystem structure.
So, there is 'utility' in "overloading the filename space". But...
1023 byte names can mean less than 250 characters due to use of unicode and utf-8. Add to it unicode normalization which might "expand" some characters into two or more combining characters, deliberate use of combining characters, emoji, rare characters, and you might end up with many "characters" taking more than 4 bytes. A single "country flag" character will be usually 8 bytes, usually most emoji will be at least 4 bytes, skin tone modifiers will add 4 bytes, etc.
this ' ' takes 27 bytes in my terminal, '' takes 28, another combo I found is 35 bytes.
And that's on top of just getting a long title using let's say one of CJK or other less common scripts - an early manuscript of somewhat successful Japanese novel has a non-normalized filename of 119 byte, and it's nowhere close to actually long titles, something that someone might reasonably have on disk. A random find on the internet easily points to a book title that takes over 300 bytes in non-normalized utf8.
P.S. proper title of "Robinson Crusoe" if used as filename takes at least 395 bytes...
The first was "man+woman kissing" with skin tone modifier, then there was few flags
https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...
It works by making a readonly copy of the vdev being removed inside the remaining space. The existing vdev is then removed. Data can still be accessed from the copy, but new writes will go to an actual vdev while data no longer needed on the copy is gradually reclaimed as free space as the old data is no longer needed.
Not really sure if that's true. They seem like two different/distinct use cases, though there's probably some small overlap.
If you can convince btrfs balance to not use the dev to remove it will simply rebalance data to the other devs and then you can btrfs device remove.
Perhaps I am misunderstanding you, but you can offline and remove drives from a ZFS pool.
Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure the drive topology?
In windows, you can set a disk for removal, and as long as the other disks have enough space and are compatible with the virtual disks (eg you need at least 5 disks if you have parity with number of columns=5), it will rebalance the blocks onto the other disks until you can safely remove the disk. If you use thin provisioning, you can also change your mind about the settings of a virtual disk, create a new one on the same pool, and move the data from one to the other.
Mdadm/lvm will do the same albeit with more of a pain in the arse as RAID requires to resilver not just the occupied space but also the free space so takes a lot more time and IO than it should.
It's one of my beef with ZFS, there are lots of no return decisions. That and I ran into some race conditions with loading a ZFS array on boot with nvme drives on ubuntu. They seem to not be ready, resulting in randomly degraded arrays. Fixed by loading the pool with a delay.
This kinda implies that you can't actually remove data vdevs, because in practice you can't rewrite all references. You also can't do offline deduplication without rewriting references (i.e. actually touching the files in the filesystem). And that's why ZFS can't deduplicate snapshots after the fact.
On the other hand, reshaping a vdev is possible, because that "just" requires shuffling the vblock -> physical block associations inside the vdev.
https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...
mdadm can convert RAID-5 to a larger or smaller RAID-5, RAID-6 to a larger or smaller RAID-6, RAID-5 to RAID-6 or the other way around, RAID-0 to a degraded RAID-5, and many other fairly reasonable operations, while the array is online, resistant to power loss and the likes.
I wrote the first version of this md code in 2005 (against kernel 2.6.13), and Neil Brown rewrote and mainlined it at some point in 2006. ZFS is… a bit late to the party.
By the way, what does MD do when there is corrupt data on disk that makes it impossible to know what the correct reconstruction is during a reshape operation? ZFS will know what file was damaged and proceed with the undamaged parts. ZFS might even be able to repair the damaged data from ditto blocks. I don’t know what the MD behavior is, but its options for handling this are likely far more limited.
I don't know what md does if the parity doesn't match up, no. (I've never ever had that happen, in more than 25 years of pretty heavy md use on various disks.)
Reshaping traditional storage stacks does not consider all of the ways things can go wrong. Handling all of them well is hard, if not impossible to do in traditional RAID. There is a long history of hardware analogs to MD RAID killing parity arrays when they encounter silent corruption that makes it impossible to know what is supposed to be stored there. There is also the case where things are corrupted such that there is a valid reconstruction, but the reconstruction produces something wrong silently.
Reshaping certainly is easier to do with MD RAID, but the feature has the trade off that edge cases are not handled well. For most people, I imagine that risk is fine until it bites them. Then it is not fine anymore. ZFS made an effort to handle all of the edge cases so that they do not bite people and doing that took time.
Yet people are celebrating when ZFS adds it. Was it all for nothing?
Those corruption issues I mentioned, where the RAID controller has no idea what to do, affect far more than just reshaping. They affect traditional RAID arrays when disks die and when patrol scrubs are done. I have not tested MD RAID on edge cases lately, but the last time I did, I found MD RAID ignored corruption whenever possible. It would not detect corruption in normal operation because it assumed all data blocks are good unless SMART said otherwise. Thus, it would randomly serve bad data from corrupted mirror members and always serve bad data from RAID 5/6 members whenever the data blocks were corrupted. This was particularly tragic on RAID 6, where MD RAID is hypothetically able to detect and correct the corruption if it tried. Doing that would come with such a huge performance overhead that it is clear why it was not done.
Getting back to reshaping, while I did not explicitly test it, I would expect that unless a disk is missing or disappears during a reshape, MD RAID would ignore any corruption that can be detected using parity and assume all data blocks are good just like it does in normal operation. It does not make sense for MD RAID to look for corruption during a reshape operation, since not only would it be slower, but even if it finds corruption, it has no clue how to correct the corruption unless RAID 6 is used, there are no missing/failed members and the affected stripe does not have any read errors from SMART detecting a bad sector that would effectively make it as if there was a missing disk.
You could do your own tests. You should find that ZFS handles edge cases where the wrong thing is in a spot where something important should be gracefully while MD RAID does not. MD RAID is a reimplementation of a technology from the 1960s. If 1960s storage technology handled these edge cases well, Sun Microsystems would not have made ZFS to get away from older technologies.
It's the first release with the code, so "safely" might not be the right description until a few point releases happen. ;)
Is the ZFS team handling encryption as a first class priority at all?
ZFS on Linux inherited a lot of fame from ZFS on Solaris, but everyone using it in production should study the issue tracker very well for a realistic impression of the situation.
The way ZFS encryption is layered, the features should be pretty much orthogonal from each other, but I'll admit that there's a bit of lacking with ZFS native encryption (though mainly in upper layer tooling in my experience rather than actual on-disk encryption parts)
There have been many man hours put into investigating bug reports involving encryption and some fixes were made. Unfortunately, something appears to be going wrong when non-raw sends of encrypted datasets are received by another system:
https://github.com/openzfs/zfs/issues/12014
I do not believe anyone has figured out what is going wrong there. It has not been for lack of trying. Raw sends from encrypted datasets appear to be fine.
How come that Windows still uses a 32 year old file system?
ZFS has license issues with Linux, preventing full integration, and Btrfs is 15 years in the making and still doesn't match ZFS in features and stability.
Most Linux distros still use ext4 by default, which is 19 years old, but ext4 is little more than a series of extensions on top of ext2, which is the same age as NTFS.
In all fairness, there are few OS components that are as critical as the filesystem, and many wouldn't touch filesystems that have less than a decade of proven track record in production.
But you must admit that the situation on Linux is quite better then on Windows. Linux has so many FS in main branch. There is a lot of development. BTRFS had a rocky start, but it got better.
* Step 1: Make friends with a ZFS developer.
* Step 2: Guilt him into writing patches to add support as soon as a new kernel is released.
* Step 3: Enjoy
Adding support for a new kernel release to ZFS is usually only a few hours of work. I have done it in the past more than a dozen times.https://wiki-dev.cachyos.org/sk/cachyos_repositories/how_to_...
Happened to me more than once. I ended up manually changing the kernel version limitations the second time just to get me back online, but I don’t recall if that ended up hurting me in the long run or not.
It’s still not supported by Proxmox, yes, you can do it yourself somehow but you are alone then and miss features and people report problems with double or triple file system layers.
I do not understand how they have not encryption out of the box, this seems to be a problem.
> The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6.
> There are some implementation and design deficiencies that make it unreliable for some corner cases and *the feature should not be used in production, only for evaluation or testing*.
> The power failure safety for metadata with RAID56 is not 100%.
I have personally been bitten once (about 10 years ago) by btrfs just failing horribly on a single desktop drive. I've used either mdadm + ext4 (for /) or zfs (for large /data mounts) ever since. Zfs is fantastic and I genuinely don't understand why it's not used more widely.
ZFS goes out of its way to avoid the buffer cache, although Linux does not give it the option to fully opt out of it since the block layer will buffer reads done by userland to disks underneath ZFS. That is why ZFS began to purge the buffer cache on every flush 11 years ago:
https://github.com/openzfs/zfs/commit/cecb7487fc8eea3508c3b6...
That is how it still works today:
https://github.com/openzfs/zfs/blob/fe44c5ae27993a8ff53f4cef...
If I recall correctly, the page cache is also still above ZFS when mmap() is used. There was talk about fixing it by having mmap() work out of ARC instead, but I don’t believe it was ever done, so there is technically double caching done there.
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
At the time, I had it this way because I had fear of OOM events causing [at least] unexpected weirdness.
A few months ago I discovered weird issues with a fairly big, persistent L2ARC being ignored at boot due to insufficient ARC. So I stopped arbitrarily limiting zfs_arc_max and just let it do its default self-managed thing.
So far, no issues. For me. With my workload.
Are you having issues with this, or is it a theoretical problem?
For a NAS setup I would still prefer ZFS with truenas scale (or proxmox if virtualization is needed), just because all these scenarios are supported as well. And as far as ZFS goes, encryption is still something I am not sure about especially since I want to use snapshots sending those as a backup to remote machine.
RAID1 is just making literal copies, so each additional drive in a RAID1 is a self-sufficient copy. You want two drives of fault tolerance? Use three drives, so if you lose two copies you still have one left.
This is of course hideously inefficient as you scale larger, but that is not the question posed.
It's not just inefficient, you literally can't scale larger. Mirroring is all that RAID 1 allows for. To scale, you'd have to switch to RAID 10, which doesn't allow two-disk fault tolerance (you can get lucky if they are in different stripes, but this isn't fault tolerance.)
But you're right - RAID 1 also scales terribly compared to RAID 6, even before introducing striping. Imagine you have 6 x 16 TB disks:
With RAID 6, usable space of 64 TB, two-drive fault tolerance.
With RAID 1, usable space of 16 TB, five-drive fault tolerance.
With RAID 10, usable space of 32 GB, one-drive fault tolerance.
Me, too. The drive was unrecoverable. I had to reinstall from scratch.
https://www.postgresql.org/message-id/CA%2BhUKGL-sZrfwcdme8j...
I know Synology Hybrid RAID is a clever use of LVM + MD raid, for example.
Related blog post: https://daltondur.st/syno_btrfs_1/
Isn't the feature in question (array expansion) precisely one which btrfs already had for a long time? Does ZFS have the opposite feature (shrinking the array), which AFAIK btrfs also already had for a long time?
(And there's one feature which is important to many, "being in the upstream Linux kernel", that ZFS most likely will never have.)
And shrinking no, that is a big missing feature in ZFS IMO. Understandable considering its heritage (large scale datacenters) but nevertheless an issue for home use.
But raidz is rock-solid. Btrfs' raid is not.
I've used it on multi-drive arrays but I never had the need for expansion.
https://openzfs.github.io/openzfs-docs/Getting%20Started/index.html
ZFS runs on all major Linux distros, the source is compiled locally and there is no meaningful license problem. In datacenter and "enterprise" environments we compile ZFS "statically" with other kernel modules all the time.For over six years now, there is an "experimental" option presented by the graphical Ubuntu installer to install the root filesystem on ZFS. Almost everyone I personally know (just my anecdote) chooses this "experimental" option. There has been an occasion here and there of ZFS snapshots taking up too much space, but other than this there have not been any problems.
I statically compile ZFS into a kernel that intentionally does not support loading modules on some of my personal laptops. My experience has been great, others' mileage may (certainly will) vary.
However, ext4 and XFS are much more simpler and performant than BTRFS & ZFS as root drives on personal systems and small servers.
I personally won't use either on a single disk system as root FS, regardless of how fast my storage subsystem is.
You might also find ZFS outperforms both of them in read workloads on single disks where ARC minimizes cold cache effects. When I began using ZFS for my rootfs, I noticed my desktop environment became more responsive and I attributed that to ARC.
https://www.percona.com/blog/mysql-zfs-performance-update/
In this older comparison where they did not initially tune ZFS properly for the database, they found XFS to perform better, only for ZFS to outperform it when tuning was done and a L2ARC was added:
https://www.percona.com/blog/about-zfs-performance/
This is roughly what others find when they take the time to do proper tuning and benchmarks. ZFS outscales both ext4 and XFS, since it is a multiple block device filesystem that supports tiered storage while ext4 and XFS are single block device filesystems (with the exception of supporting journals on external drives). They need other things to provide them with scaling to multiple block devices and there is no block device level substitute for supporting tiered storage at the filesystem level.
That said, ZFS has a killer feature that ext4 and XFS do not have, which is low cost replication. You can snapshot and send/recv without affecting system performance very much, so even in situations where ZFS is not at the top in every benchmark such as being on equal hardware, it still wins, since the performance penalty of database backups on ext4 and XFS is huge.
The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)
If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.
ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:
https://www.percona.com/blog/zfs-for-mongodb-backups/
Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?
About the inherent advantages of ZFS like send/recv, I have nothing to say. I know how good they are. It's one reason I use ZFS.
> If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do.
What does proper testing here mean? And what does "if you scale it" mean? Genuinely. From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting. What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.
Edit: P4800x, actually. The flash disk are D5-P5530.
That makes sense.
> The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.
it is a balancing act. It is a feature ZFS has that XFS does not, but it is ridiculous to use a device that can fit the entire database as L2ARC, since in that case, you can just use that device directly and keeping it as a cache for ZFS does not make for a fair or realistic comparison. Fast devices that can be used with tiered storage are generally too small to be used as main storage, since if you could use them as main storage, you would.
With the caveat that the higher tier should be too small to be used as main storage, you can get a huge boost from being able to use it as cache in tiered storage, and that is why ZFS has L2ARC.
> What does proper testing here mean? And what does "if you scale it" mean?
Let me preface my answer by saying that doing good benchmarks is often hard, so I can't give a simple answer here. However, I can give a long answer.
First, small databases that can fit entirely in RAM cache (be it the database's own userland cache or a kernel cache) are pointless to benchmark. In general, anything can run that well (since it is really running out of RAM as you pointed out). The database needs to be significantly larger than RAM.
Second, when it comes to using tiered storage, the purpose of doing tiering is that the faster tier is either too small or too expensive to use for the entire database. If the database size is small enough that it is inexpensive to use the higher tier for general storage, then a test where ZFS gets the higher tiered storage for use as cache is neither fair nor realistic. Thus, we need to scale the database to a larger size such that the higher tier being only usable as cache is a realistic scenario. This is what I had in mind when I said "if you scale it".
Third, we need to test workloads that are representative of real things. This part is hard and the last time I did it was 2015 (I had previously said 2016, but upon recollection, I realized it was likely 2015). When I did, I used a proprietary workload simulator that was provided by my job. It might have been from SPEC, but I am not sure.
Fourth, we need to tune things properly. I wrote the following documentation years ago describing correct tuning for ZFS:
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
At the time I wrote that, I omitted that tuning the I/O elevator can also improve performance, since there is no one size fits all advice for how to do it. Here is some documentation for that which someone else wrote:
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
If you are using SSDs, you could probably just get away with setting each of the maximum asynchronous queue depth limits to something like 64 (or even 256) and benchmark that.
> From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting.
In 2015 when I did database benchmarks, ZFS and XFS were given equal hardware. The hardware was a fairly beefy EC2 instance with 4x high end SSDs. MD RAID 0 was used under XFS while ZFS was given the devices in what was effectively a RAID 0 configuration. With proper tuning (what I described earlier in this reply), I was able to achieve 85% of XFS performance in that configuration. This was considered a win due to the previously stated reason of performance under database backups. ZFS has since had performance improvements done, which would probably narrow the gap. It now uses B-Trees internally to do operations faster and also now has redundant_metadata=most, which was added for database workloads.
Anyway, on equal hardware in a general performance comparison, I would expect ZFS to lose to XFS, but not by much. ZFS' ability to use tiered storage and do low overhead backups is what would put it ahead.
> What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.
You need to have a database whose size is so big that optane storage is not practical to use for main storage. Then you need to setup ZFS with Optane storage as L2ARC. You can give regular flash drives to ZFS and XFS on MD RAID in a comparable configuration (RAID 0 to make life easier, although in practice you probably want to use RAID 10). You will want to follow best practices for tuning the database and filesystems (although from what I know, XFS has remarkably few knobs). You could give XFS the optane devices to use for metadata and its journal for fairness, although I do not expect it to help XFS enough. In this situation, ZFS should win on performance.
You would need to pick a database for this. One option would be PostgreSQL, which is probably the main open source database that people would scale to such levels. The pgbench tool likely could be used for benchmarking.
https://www.postgresql.org/docs/current/pgbench.html
You would need to pick a scaling factor that will make the database big enough and do a workload simulating a large number of clients (what is large is open to interpretation).
Finally, I probably should add that the default script used by pgbench probably is not very realistic for a database workload. A real database will have a good proportion of reads from select queries (at least 50%) while the script that is being used does a write mostly workload. It probably should be changed. How is probably an exercise best left for a reader. That is not the answer you probably want to hear, but I did say earlier in this reply that doing proper benchmarks is hard, and I do not know offhand how to adjust the script to be more representative of real workloads. That said, there is definite utility in benchmarking write mostly workloads too, although that utility is probably more applicable for the database developers than as a way to determine which of two filesystems is better for running the database.
> "I personally won't use either on a single disk system as root FS, regardless of how fast my storage subsystem is." (emphasis mine)
We are no strangers to filesystems. I personally benchmarked a ZFS7320 extensively, writing a characterization report, plus we have a ZFS7420 for a very long time, complete with separate log SSDs for read and write on every box.
However, ZFS is not saturation proof, plus is nowhere near a Lustre cluster performance wise, when scaled.
What kills ZFS and BTRFS on desktop systems are write performance, esp. on heavy workloads like system updates. If I need a desktop server (performance-wise), I'd configure it accordingly and use these, but I'd never use BTRFS or ZFS on a single root disk due to their overhead, to reiterate myself thrice.
I am one of the OpenZFS contributors (although I am less active as late). If you bring some deficiency to my attention, there is a chance I might spend the time needed to improve upon it.
By the way, ZFS limits the outstanding IO queue depth to try to keep latencies down as a type of QoS, but you can tune it to allow larger IO queue depths, which should improve write performance. If your issue is related to that, it is an area that is known to be able to use improvement in certain situations:
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
https://openzfs.github.io/openzfs-docs/Performance%20and%20T...
I'll be installing a 2 disk OpenZFS RAID1 volume on a SBC for high value files soon-ish, and I might be doing some tests on that when it's up. Honestly, I don't expect stellar performance since I'll be already putting it on constrained hardware, but let you know if I experience anything that doesn't feel right.
Thanks for the doc links, I'll be devouring them when my volume is up and running.
Where do you prefer your (bug and other) reports? GitHub? E-mail? IP over Avian Carriers?
In the case of APT, it should install all of the files and then call sync() once. This is equivalent of calling fsync on every file like APT currently does, but aggregates it for efficiency. The reason APT does not use sync() is probably a portability thing, because the standard does not require sync() to be blocking, but on Linux it is:
https://www.man7.org/linux/man-pages/man2/sync.2.html
From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package. Thus it does not really matter for power loss protection if you are using fsync() on all files or sync() once for all files, since what must happen next to fix it is the same. However, from a performance perspective, it really does matter.
That said, slow fsync performance generally is not an issue for desktop workloads because they rarely ever use fsync. APT is the main exception. You are the first to complain about APT performance in years as far as I know (there were fixes to improve APT performance 10 years ago, when its performance was truly horrendous).
You can file bug reports against ZFS here:
https://github.com/openzfs/zfs
I suggest filing a bug report against APT. There is no reason for it to be doing fsync calls on every file it installs in the filesystem. It is inefficient.
> From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package.
Yes, but at least you have all the files, otherwise you can have 0 length files which can prevent you from booting your system. In this case, your system boots, all files are in place, but some packages are in semi-configured state. Believe me, apt can recover from many nasty corners without any ill effects as long as all files are there. I used to be a tech-lead for a Debian derivative back in the day, so I lived in the trenches in Debian for a long time, so I have seen things.
Again it's decided that the massive sync will stay in place for now, because the risks involved in the wild doesn't justify the performance difference yet. If you prefer to be reckless, there's "eatmydata" and "--force-unsafe-io" options baked in already.
Thanks for the links, I'll let you know if I find something. I just need to build the machine from the parts I have, then I'll be off to the races.
[0]: https://lists.debian.org/debian-devel/2024/12/msg00533.html [warning, long thread]
https://lists.debian.org/debian-devel/2024/12/msg00540.html
It claims that the fsync is needed to avoid the file appearing at the final location with a zero length after a power loss. This is not true on ZFS.
ZFS puts every filesystem operation into a transaction group that is committed atomically about every 5 seconds by default. On power loss, the transaction group either succeeds or never happens. The result is that even without using fsync, there will never be a zero length file at the final location because the rename being part of a successful transaction group commit implies that the earlier writes also were part of a successful transaction group commit.
The result is that you can use --force-unsafe-io with dpkg on ZFS, things will run faster and there should be no issues for power loss recovery as far as zero length files go.
The following email mentions that sync() had been used at one point but caused problems when flash drives were connected, so it was dropped:
https://lists.debian.org/debian-devel/2024/12/msg00597.html
The timeline is unclear, but I suspect this happened before Linux 2.6.29 introduced syncfs(), which would have addressed that. Unfortunately, it would have had problems for systems with things like a separate /usr mount, which requires the package manager to realize multiple syncfs calls are needed. It sounds like dpkg was calling sync() per file, which is even worse than calling fsync() per file, although it would have ensured that the directory entries for prior files were there following a power loss event.
The email also mentions that fsync is not called on directories. The result is that a power loss event (on any Linux filesystem, not just ZFS) could have the files missing from multiple packages marked as installed in the package database, which is said to use fsync to properly record installations. I find this situation weird since I would use sync() to avoid this, but if they are comfortable having systems have multiple “installed” packages missing files in the filesystem after a power loss, then there is no need to use sync().
Also, given that you seem quite knowledgeable of the topic, what is your go-to backup solution?
I initially thought about storing `zfs send` files into backblaze (as backup at a different location), but without recv-ing these, I don't think the usual checksumming works properly. I can checksum the whole before and after updating, but I'm not convinced if this is the best solution.
DPKG will write each file to a temporary location and then rename it to the final location. On ext4, without fsync, when a power loss event occurs, it is possible for the rename to the final location to be done, without any of the writes such that you have a zero length file. On ZFS, the rename being done after the writes means that the rename being done implies the writes were done due to the sequential nature of ZFS’ transaction group commit, so the file will never appear in the final location without the file contents following a power loss event, which is why ZFS does not need the fsync there.
https://openzfsonosx.org/wiki/Main_Page
It was actually the NetApp lawsuit that caused problems for Apple’s adoption of ZFS. Apple wanted indemnification from Sun because of the lawsuit, Sun’s CEO did not sign the agreement before Oracle’s acquisition of Sun happened and Oracle had no interest in granting that, so the official Apple port was cancelled.
I heard this second hand years later from people who were insiders at Sun.
While third-party ports are great, they lack deep integration that first-party support would have brought (non-kludgy Time Machine which is technically fixed with APFS).
It was killed because Apple and Sun couldn't agree on a 'support contract'. From Jeff Bonwick, one of the co-creators ZFS:
>> Apple can currently just take the ZFS CDDL code and incorporate it (like they did with DTrace), but it may be that they wanted a "private license" from Sun (with appropriate technical support and indemnification), and the two entities couldn't come to mutually agreeable terms.
> I cannot disclose details, but that is the essence of it.
* https://archive.is/http://mail.opensolaris.org/pipermail/zfs...
Sun took DTrace, licensed via CDDL—just like ZFS—and put it into the kernel without issue. Of course a file system is much more central to an operating system, so they wanted much more of a CYA for that.
Naa it was Jobs ego not the license:
>>Only one person at Steve Jobs' company announces new products: Steve Jobs.
https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-ap...
But I think these things are usually a combination. When a business relationship sours, agreements are suddenly much harder to work out. The negotiators are still people and they have feelings that will affect their decisionmaking.
What's "new" about the "new" API? Its symbols are GPL2 only to deny it's use to non-GPL2 modules (like ZFS). Guess that's an easy way to make sure that BTRFS is faster than ZFS or set yourself up as the (to be) injured party.
Of course a reimplementation of the old API in terms of the new is an evil "GPL condom" violating the kernel license right? Why can't you see ZFS's CDDL2 license is the real problem here for being the wrong flavour of copyleft license. Way to claim the moral high ground you short-sighted, bigoted pricks. sigh
zfs modules are not in the official repos. You either have to compile it on each machine or use unofficial repos, which is not exactly ideal and can break things if those repos are not up to date. And I guess it also needs some additional steps for secureboot setup on some distros?
I really want to try zfs because btrfs has some issues with RAID5 and RAID6 (it is not recommended so I don't use it) but I am not sure I want to risk the overall system stability, I would not want to end up in a situation where my machines don't boot and I have to fix it manually.
I have never had data corruption issues with ZFS, but I have had both xfs and ext4 destroy entire discs.
Why use zfs for a boot partition? Unless you're using every disk mounting point and nvme slot for a single large raid array, you can use a cheap 512GB nvme drive or old spare 2.5" ssd for the boot volume. Or two, in btrfs raid1 if you absolutely must... but do you even need redundancy or datasum (which can hurt performance) to protect OS files? Do you really care if static package files get corrupted? Those are easily reinstalled, and modern quality brand SSDs are quite reliable.
I want to use raid 5 for the large storage mount point that holds non-OS files. I want both space and redundancy. Currently I have several separate raid1 btrfs mounts since it is recommended against raid5.
The reason for this is not just to piss off non-GPL module developers. GPL-only internal APIs are subject to change without notice, even more so than the rest of the kernel. And because the licence may not allow the Linux kernel developers to make the necessary changes to the module when it happens, there is a good chance it breaks without warning.
And even with that, all internal APIs may change, it is just a bit less likely than for the GPL-only ones, and because ZFS on Linux is a separate module, there is no guarantee for it to not break with successive Linux versions, in fact, it is more like a guarantee that it will break.
Linux is proudly monolithic, and as constantly evolving a monolithic kernel, developers need to have control over the entire project. It is also community-driven. Combined, you need rules to have the community work together, or everything will break down, and that's what the GPL is for.
Fedora packages zfs-fuse. I think some distros have arrangements to make sure kernels have zfs support. It may be less of a headache on those
In tree fs don't break that way
Btrfs also supports async/offline dedupe
You can also layer it on top of mdadm. Iirc zfs strongly discourages using anything but direct attached physical disks.
Simple. Because most of the burden is taken by the (enterprise) storage hardware hosting the FS. Snapshots, block level deduplication, object storage technologies, RAID/Resiliency, size changes, you name it.
Modern storage appliances are black magic, and you don't need much more features from NTFS. You either transparently access via NAS/SAN or store your NTFS volumes on capable disk boxes.
On the Linux world, at the higher end, there's Lustre and GPFS. ZFS is mostly for resilient, but not performance critical needs.
Los Alamos disagrees ;)
https://www.lanl.gov/media/news/0321-computational-storage
But yes, in general you are right, Cern for example uses Ceph:
https://indico.cern.ch/event/1457076/attachments/2934445/515...
CERN's Ceph also for their "General IT" needs. Their clusters are independent from that. Also CERN's most processing is distributed across Europe. We are part of that network.
Many, if not all of the HPC centers we talk with uses Lustre as their "immediate" storage. Also, there's Weka now, a closed source storage system supporting insane speeds and tons of protocols at the same time. Mostly used for and by GPU clusters around the world. You connect terabits to that cluster casually. It's all flash, and flat out fast.
I have all my backups on a NAS in the next room. This covers the vast majority of use cases for backups, but if my house burns down everything is lost. I know I'm taking that risk, but really I should have better. Just paying someone to do it all in the cloud should be better for me as well and I keep thinking I should do this.
Of course paying someone assumes they will do their job. There are always incompetent companies out there to take your money.
I lost faith in most paid operators. Whoops, this thing that absolutely can happen to home users and we're supposed to protect them from now actually happened to us and we were not prepared. We're so sorry!
Nah. Give me access to 5-15 cloud storage accounts, I'll handle it myself. Have done so for years.
I trust my own backups much more than any subscription, not essentially from a technical point of view, but from an access point of view (e.g. losing access to your Google account).
EDIT: You have to enable check-summing and/or compression for data on ReFS manually
https://learn.microsoft.com/en-us/windows-server/storage/ref...
I personally use cloud storage extensively, but I keep a local version with periodic rclone/borg. It allows me access from everywhere and sleep well at night.
Just because someone is a private user doesn't mean that the data is less important, often it's quite the opposite, for example a family album vs your cloned git repository.
Your average grandma would use ext4 when using Linux. Android phones don't do that as well and I don't know about iOS, but apparently APFS only does metadata checksumming.
So, I think that your answer for this question is "unfortunately, yes".
Not that I support the situation.
ZFS is not in the same market with AWS S3.
The mainline Linux kernel doesn't either, and I think the answer is because it's hard and high risk with a return mostly measured in technical respect?
But considering it's had two drama events within 1 year of getting merged... I think we can safely confirm your conclusion of it being really hard
bcachefs doesn't implement its erasure coding/RAID yet? Doesn't implement send/receive. Doesn't implement scrub/fsck. See: https://bcachefs.org/Roadmap, https://bcachefs.org/Wishlist/
btrfs is still more of a legit competitor to ZFS these days and it isn't close to touching ZFS where it matters. If the perpetually half-finished bcachefs and btrfs are the "answer" to ZFS that seems like too little, too late to me.
It most definitely does have fsck and has since the beginning, and it's a much more robust and dependable fsck than btrfs's. Scrub isn't quite done - I actually was going to have it ready for this upcoming merge window except for a nasty bout of salmonella :)
Send/recv is a long ways off, there might be some low level database improvements needed before that lands.
Short term (next year or two) priorities are finishing off online fsck, more scalability work (upcoming version for this merge window will do 50PB, but now we need to up the limit on number of drives), and quashing bugs.
In general, due to the scope of the project, I've been prioritizing the functionality that's needed to validate the design and the parts that are needed for getting the relationships between different components correct.
e.g. recently I've been doing a bunch of work on backpointers scalability, and that plus scrub are leading to more back and forth iteration on minor interactions with erasure coding.
So: erasure coding is complete enough to know that it works and for people to torture test it, but yes you shouldn't be running it in production yet (and it's explicitly marked as such). What's remaining is trivial but slightly tedious stuff that's outside the critical path of the rest of the design.
Some of the code I've been writing for scrub is turning out to also be what we want for reconstruct, so maybe we'll get there sooner rather than later...
Did the Linux Foundation send you some "free" sushi? ;)
However keep the good work rolling, super happy about a good, usable and modern Filesystem native to Linux.
Hope that's coming this year. I have a bunch of old HDDs and SSDs and I could very easily assemble a spare storage server with about 4TB capacity. Already tested bcachefs with most of the drives and it performed very well.
Also lack of ability to reconstruct seems like another worrying omission.
That should be a completely trivial for bcachefs to support, it'll mostly just be a matter of finding or writing the tests.
If you or others can get it done I'm absolutely starting to use bcachefs the month after. I do need fast storage servers in my home office.
As for ZFS, I'll be buying some extra drives later this year and will make use of direct_io so I can use another NVMe spare for faster access.
When I can spend some $3000 or so I'll absolutely buy several 20 TB drives and just nail the whole thing -- and will use ZFS -- but for now the several spare HDDs that I want dedicated to my data are set up exactly as you mentioned: root vdevs with no redundancy. ZFS is mostly handling it fine even though the drives have vastly different speeds (and one of them is actually an SSD).
So yep ZFS can still do quite a lot, it's just still not flexible enough in a manner that f.ex. bcachefs is. But the latter is still missing important features so I am sticking with ZFS for a while still.
* If the hard drive / SSD corrupted blocks, the corruption would be identified.
* Ditto blocks allow for self healing. Usually, this only applies to metadata, but if you set copies=2, you can get this on data too. It is a poor man’s RAID.
* ARC made the desktop environment very responsive since unlike the LRU cache, ARC resists cold cache effects from transient IO workloads.
* Transparent compression allowed me to store more on the laptop than otherwise possible.
* Snapshots and rollback allowed me to do risky experiments and undo them as if nothing happened.
* Backups were easy via send/receive of snapshots.
* If the battery dies while you are doing things, you can boot without any damage to the filesystem.
That said, I use a MacBook these days when I need to go outside. While I miss ZFS on it, I have not felt motivated to try to get a ZFS rootfs on it since the last I checked, Apple hardcoded the assumption that the rootfs is one of its own filesystems into the XNU kernel and other parts of the system.For other stuff, let that nerdy CorpIT handle your system.
Also snapshots are great regardless.
Every ZFS block pointer has room for 3 disk addresses; by default, the extras are used only for redundant metadata, but they can also be used for user data.
When you turn on ditto blocks for data (zfs set copies=2 rpool/foo), zfs can fix corruption even on single-drive systems at the cost of using double or triple the space. Note that (like compression), this only affects blocks written after the setting is in place, but (if you can pause writes to the filesystem) you can use zfs send|zfs recv to rewrite all blocks to ensure all blocks are redundant.
Windows has its own extended filesystem through Storage Spaces, with many ZFS features added as lesser used Storage Spaces options, especially when combined with ReFS.
Rot means changing bits without accessing those bits, and that's ~not possible with zfs, additionally you can enable check-summing IN the ARC (disabled by default), and with that you can say that ECC and "enterprise" quality hardware is even more important for non-ZFS systems.
>Correct data with a faulty checksum will be treated the same as incorrect data with a correct checksum.
There is no such thing as "correct" data, only a block with a correct checksum, if the checksum is not correct, the block is not ok.
No. Bad HDDs/SSDs or bad SATA cables/ports cause a lot more data corruption than bad RAM. And ZFS will correct these cases even without ECC memory. It's a myth that the data healing properties of ZFS are useless without ECC memory.
The key feature for me, which I miss, is the snapshotting integrated into the package manager.
ZFS allows snapshots more or less for free (due to copy on weite) including cron based snapshotting every 15 minutes. So if I did a mistake anywhere there was a way to recover.
And that integrated with the update manager and boot manager means that on an update a snapshot is created and during boot one can switch between states. Never had a broken update, but gave a good feeling.
On my home server I like the raid features and on Solaris it was nicely integrated with NFS etc so that one can easily create volumes and export them and set restrictions (max size etc.) on it.
some linux distros have that by default with btrfs. And usually it's a package install away if you're already on btrfs.
(I have never used dedup, but it's there if you want I guess)
Reading any file will tell you with 100% guarantee if it is corrupt or not.
Snapshots that you can `cd` into, so you can compare any prior version of your FS with the live version of your FS.
Block level compression.
Only possible if it was not corrupted in RAM before it was written to disk.
Using ECC memory is important, irrespective of ZFS.
Honestly I really like ReFS, particularly in context of storage spaces, but I don't think it's relevant to Microsoft's consumer desktop OS where users don't have 6 drives they need to pool together. Don't get me wrong, I use ZFS because that's what I can get running on a Linux server and I'm not going to go run Windows Server just for the storage pooling... but ReFS + Storage Spaces wins my heart with the 256 MB slab approach. This means you can add+remove mixed sized drives and get the maximum space utilization for the parity settings of the pool. Here ZFS is still getting to online adds of same or larger drives 10 years later.
For example, there are numerous new file systems people use: OneDrive, Google Drive, iCloud Storage. Do you get it?
https://github.com/openzfsonwindows/openzfs
Note that it is a beta.
CoW, advanced parity, and checksumming are the big ones NTFS lacks. CoW is just inherently not how NTFS is designed and checksumming isn't there. Anything else (encryption, compression, snapshots, ACLs, large scale, virtual devices, basic parity) is done through NTFS on Windows.
And I don't think anyone bothers to use it due to the lack of user-facing tooling around it. If it would be as easy to create snapshots as it is on ZFS, more people would use it, I'm sure. It's just so amazing to try something out, screw up my system and just revert :P But VSS is more of a system API than a user-facing geature.
VSS is also used by backup software to quiet the filesystem by the way.
But yeah the others are great features. My main point was though that almost all the features of ZFS are very beneficial even on a single drive. You don't need an array to take advantage of Snapshots, the crash reliability that CoW offers, and checksumming (though you will lack the repair option obviously)
ZFS on Windows, as a first-class supported-by-Microsoft option would be killer. It won't ever happen, but it would be great. (NTFS / VSS with filesystem/snapshot send/receive would "scratch" a lot of that "itch", too.)
> And I don't think anyone bothers to use it due to the lack of user-facing tooling around it. If it would be as easy to create snapshots as it is on ZFS, more people would use it, I'm sure. It's just so amazing to try something out, screw up my system and just revert :P But VSS is more of a system API than a user-facing geature.
VSS on NTFS is handy and useful but in my experience brittle compared to ZFS snapshots. Sometimes VSS just doesn't work. I've had repeated cases over the years where accessing a snapshot failed (with traditional unhelpful Microsoft error messages) until the host machine was rebooted. Losing VSS snapshots on a volume is much easier than trashing a ZFS volume.
VSS straddles the filesystem and application layers in a way that ZFS doesn't. I think that contributes to some of the jank (VSS writers becoming "unstable", for example). It also straddles hardware interfaces in a novel way that ZFS doesn't (using hardware snapshot functionality-- somewhat like using a GPU versus "software rendering"). I think that also opens up a lot of opportunity for jank, as compared to ZFS treating storage as dumb blocks.
Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.
That said, there is one tiny caveat people should be aware of:
> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).
For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.
1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.
2. Use send/receive inside the pool.
Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.
Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.
Da1 Db1 Dc1 Pa1 Pb1
Da2 Db2 Dc2 Pa2 Pb2
Da3 Db3 Dc3 Pa3 Pb3
___ ___ ___ Pa4 Pb4
___ represents free space. After expansion by one disk you would logically expect something like: Da1 Db1 Dc1 Da2 Pa1 Pb1
Db2 Dc2 Da3 Db3 Pa2 Pb2
Dc3 ___ ___ ___ Pa3 Pb3
___ ___ ___ ___ Pa4 Pb4
But as I understand it it would actually expand to: Da1 Db1 Dc1 Dd1 Pa1 Pb1
Da2 Db2 Dc2 Dd2 Pa2 Pb2
Da3 Db3 Dc3 Dd3 Pa3 Pb3
___ ___ ___ ___ Pa4 Pb4
Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.
The slides here explain how it works:
https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.
I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.
You can see this in the presentation[1] slides[2].
The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.
Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then
Da1 Db1 Dc1 Pa1 Pb1
Da2 Db2 Dc2 Pa2 Pb2
Da3 Db3 Pa3 Pb3 ___
would after RAID-Z expansion would become Da1 Db1 Dc1 Pa1 Pb1 Da2
Db2 Dc2 Pa2 Pb2 Da3 Db3
Pa3 Pb3 ___ ___ ___ ___
Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.However if the same data was written in the post-expanded vdev configuration, it would have become
Da1 Db1 Dc1 Dd1 Pa1 Pb1
Da2 Db2 Dc2 Dd2 Pa2 Pb2
___ ___ ___ ___ ___ ___
Ie, you'd have 6 free blocks not just 4 blocks.Of course this doesn't count for writes which end up taking less than the maximal stripe width.
[1]: https://www.youtube.com/watch?v=tqyNHyq0LYM
[2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf
There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.
edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.
The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.
> Unraid saves data to individual drives rather than spreading single files out over multiple drives
https://en.wikipedia.org/wiki/Unraid#Software-defined_NAS
At least, it is not one in the sense of the original RAID paper that coined the term:
https://web.mit.edu/6.033/2015/wwwdocs/papers/Patterson88.pd...
However, if you replace the smallest disk with a new, larger drive, and resilver, then it'll now use the new smallest disk as the baseline, and use that extra space on the other drives.
(Someone please correct me if I'm wrong.)
This might be misleading, however, it may only be my understanding of word "array".
You can use 2x10TB mirrors as vdev0, and 6x12TB in RAIDZ2 as vdev1 in the same pool/array. You can also stack as many unevenly sized disks as you want in a pool. The actual problem is when you want a different drive topology within a pool or vdev, or you want to mismatch, say, 3 oddly sized drives to create some synthetic redundancy level (2x4TB and 1x8TB to achieve two copies on two disks) like btrfs does/tries to do.
This new operation seems somewhat more sophisticated.
btrfs and the new bcachefs can do RAID with mixed drives, but I can’t trust either of them with my data yet.
You can do borderline insane single-vdev setups like RAID-Z3 with 4 disks (3 Disks worth of redundancy) of the most expensive and highest density hard drives money can buy right now, for an initial effective space usage of 25% and then keep buying and expanding Disk by Disk, with the space demand growing, up to something like 12ish disks. Disk prices dropping as time goes on and a spread out failure chance with disks being added at different times.
When you expand your array, your existing data will not be stored any more efficiently.
To get the new parity/data ratios, you would have to force copies of the data and delete the old, inefficient versions, e.g. with something like this [1]
My personal take is that it's a much better idea to buy individual complete raid-z configurations and add new ones / replace old ones (disk by disk!) as you go.
If btrfs had been in better shape, I would have been a btrfs contributor. Unfortunately for btrfs, it not only was in bad shape back then, but other btrfs issues continued to bite me every time I tried it over the years for anything serious (e.g. frequent ENOSPC errors when there is still space). ZFS on the other hand just works. Myself and many others did a great deal of work to ensure it works well.
The main reason for the difference is that ZFS had a very solid foundation, which was achieved by having some fantastic regression testing facilities. It has a userland version that randomly exercises the code to find bugs before they occur in production and a test suite that is run on every proposed change to help shake out bugs.
ZFS also has more people reviewing proposed changes than other filesystems. The Btrfs developers will often state that there is a significant man power difference between the two file systems. I vaguely recall them claiming the difference was a factor of 6.
Anyway, few people who use ZFS regret it, so I think you will find you like it too.
Some simple use-cases are arguably production ready with BTRFS, YMMV.
I use ZFS on Arch Linux and overall have had no problems with it so far. There's more customization and methods to optimize performance. My one suggestion is to do a lot of research and testing with ZFS. There is a bit of a learning curve, but it's been worth the switch for me.
As far as I know shrinking a pool is still not possible though. So if you have a pool with 5 drives and add a 6th, you can't go back to 5 drives even if there is very little data in it.
Checksums: this is even more important in home usage as the hardware is usually of lower quality. Faulty controllers, crappy cables, hard disks stored in a higher than advised temperature... many reasons for bogus data to be saved, and zfs handles that well and automatically (if you have redundancy)
Snapshots: very useful to make backups and quickly go back to an older version of a file when mistakes are made
Ease of mind: compared to the alternatives, I find that zfs is easier to use and makes it harder to make a mistake that could bring data loss (e.g. remove by mistake the wrong drive when replacing a faulty one, pool becomes unusable, "ops!", put the disk back, pool goes back to work as nothing happened). Maybe it is different now with mdadm, ma when I used it years ago I was always worried to make a destructive mistake.
Piling on here: Sending snapshots to remote machines (or removable drives) is very easy. That makes snapshots viable as a backup mechanism (because they can exist off-site and offline).
A friend uses ZFS to store his Steam games on a couple of hard drives. He gave ZFS a SSD to use as L2ARC. ZFS automatically caches the games he likes to run on the SSD so that they load quickly. If he changes which games he likes to run, ZFS will automatically adapt to cache those on the SSD instead.
Snapshots ("boot environments") are also supported by Btrfs (my Linux installations use that so I don't have to worry about having the 3rd party kernel module to read my rootfs). Performance isn't that great either and, assuming Linux, XFS is a better choice if that is your main concern.
Automated snapshots are super easy and fast. Also easy to access if you deleted a file, you don't have to restore the whole snapshot, you can just cp from the hidden .zfs/ folder.
I run it on 6x 8TB disk for a couple of years now. I run it in a raidz2, which means up to 2 disk can die. Would I use it on a single disk on a Desktop? Probably not.
I do. Snapshots and replication and checksumming are awesome.
I do occasionally use snapshots and the compression feature is handy on quite a lot of my data set but I don't use the user and group limitations or remote send and receive etc. ZFS does a lot more than I need but it also works really well and I wouldn't move away from a checksumming filesystem now.
- Confidence in my long-term storage of some data I care about, as zpool scrub protects against bit rot
- Cheap snapshots that provide both easy checkpoints for work saved to my network share, and resilience against ransomware attacks against my other computers' backups to my NAS
- Easy and efficient (zfs send) replication to external hard drives for storage pool backup
- Built-in and ergonomic encryption
And it's really pretty easy to use. I started with FreeNAS (now TrueNAS), but eventually switched to just running FreeBSD + ZFS + Samba on my file server because it's not that complicated.
- a single solution that covers the entire storage domain (I don't have to learn multiple layers, like logical volume manager vs. ext4 vs. physical partitions) - cheap/free snapshots. I have been glad to have been able to revert individual files or entire file systems to an earlier state. E.g., create a snapshot before doing a major distro update. - easy to configure/well documented
Like others have said, at this point I would need a good reason, NOT to use ZFS on a system.
Mostly because it's there, but also the snapshots have a `diff` feature that's occasionally useful.
For those not in the know:
You don't need to use enterprise quality disks. There is nothing in the ZFS design that requires enterprise quality disks any more than any other file system. In fact, ZFS has saved my data through multiple consumer-grade HDD failures over the years thanks to raidz.
The 1 gig per TB figure is ONLY for when using the ZFS dedup feature, which the ZFS dedup feature is widely regarded as a bad idea except in VERY specific use cases. 99.9% of ZFS users should not and will not use dedup and therefore they do not need ridiculous piles of ram.
There is nothing in the design of ZFS any more dangerous to run without ECC than any other filesystem. ECC is a good idea regardless of filesystem but its certainly not a requirement.
And you don't need x5 disks of redundancy. It runs great and has benefits even on single-disk systems like laptops. Naturally, having parity drives is better in case a drive fails but on single disk systems you still benefit from the checksumming, snapshotting, boot environments, transparent compression, incremental zfs send/recv, and cross-platform native encryption.
I'm in the process of setting up a server at home and was testing a few different file systems. I was doing a test where I had a program continuously synchronously writing just a single byte every second (like might happen for some programs that are writing logs fairly continuously). For most of my tests I was just using the default settings for each file system. When using ext4 this resulted in 28 KB/s of actual writes being done to the drive which seems reasonable due to 4 KB blocks needing to be written, journaling, writing metadata, etc... BTRFS generated 68 KB/s of actual writes which still isn't too bad. When using ZFS about the best I could get it to do after trying various settings for volblocksize, ashift, logbias, atime, and compression settings still resulted in 312 KB/s of actual writes being done to the drive which I was not pleased with. At the rate ZFS was writing data, over a 10 year span that same program running continuously would result in about 100 TB of writes being done to the drive which is about a quarter of what my SSD is rated for.
> like might happen for some programs that are writing logs fairly continuously
On Linux, I think journald would be aggregating your logs from multiple services so at least you wouldn't be incurring that cost on a per-program basis. On FreeBSD with syslog we're doomed to separate log files.
> over a 10 year span that same program running continuously would result in about 100 TB of writes being done to the drive which is about a quarter of what my SSD is rated for
I sure hope I've upgraded SSDs by the year 2065.
I don't believe that zfs_txg_timeout setting would make much of a difference for the test I described where I was doing synchronous writes.
> On Linux, I think journald would be aggregating your logs from multiple services so at least you wouldn't be incurring that cost on a per-program basis.
The server I'm setting up will be hosting several VMs running a mix of OSes and distros and running many types types of services and apps. Some of the logging could be aggregated but there will be multiple types of I/O (various types of databases, app updates, file server, etc...) and I wanted to get an idea of how much file system overhead there might be in a worst case kind of scenario.
> I sure hope I've upgraded SSDs by the year 2065.
Since I'll be running a lot of stuff on the server, I'll probably have quite a bit more writing going on than the test I described so if I used ZFS I believe the SSD could reach its rated endurance in just several years.
My mind jumped at that too when I first read parent's comment. But presumably he's writing other files to disk too. Not just that one file. :)
Yes, there will be much more going on than the simple test I was doing. The server will be hosting several VMs running a mix of OSes and distros and running many types types of services and apps.
You also really don't need a 1GB for RAM unless you have a very high write volume. YMMV but my experience is that its closer to 1GB for 10TB.
- Excellent filesystem level backup facility. I can transfer snapshots to a spare drive, or send/receive to a remote (at present a spare computer, but rsync.net looks better every year I have to fix up the spare).
- Unlike other fs-level backup solutions, the flexibility of zvols means I can easily expand or shrink the scope of what's backed up.
- It's incredibly easy to test (and restore) backups. Pointing my to-be-backed-up volume, or my backup volume, to a previous backup snapshot is instant, and provides a complete view of the filesystem at that point in time. No "which files do you want to restore" hassles or any of that, and then I can re-point back to latest and keep stacking backups. Only Time Machine has even approached that level of simplicity in my experience, and I have tried a lot of backup tools. In general, backup tools/workflows that uphold "the test process is the restoration process, so we made the restoration process as easy and reversible as possible" are the best ones.
- Dedup occasionally comes in useful (if e.g. I'm messing around with copies of really large AI training datasets or many terabytes of media file organization work). It's RAM-expensive, yes, but what's often not mentioned is that you can turn it on and off for a volume--if you rewrite data. So if I'm looking ahead to a week of high-volume file wrangling, I can turn dedup on where I need it, start a snapshot-and-immediately-restore of my data (or if it's not that many files, just cp them back and forth), and by the next day or so it'll be ready. Turning it off when I'm done is even simpler. I imagine that the copy cost and unpredictable memory usage mean that this kind of "toggled" approach to dedup isn't that useful for folks driving servers with ZFS, but it's outstanding on a workstation.
- Using ZFSBootMenu outside of my OS means I can be extremely cavalier with my boot volume. Not sure if an experimental kernel upgrade is going to wreck my graphics driver? Take a snapshot and try it! Not sure if a curl | bash invocation from the internet is going to rm -rf /? Take a snapshot and try it! If my boot volume gets ruined, I can roll it back to a snapshot in the bootloader from outside of the OS. For extra paranoia I have a ZFSBootMenu EFI partition on a USB drive if I ever wreck the bootloader as well, but the odds are that if I ever break the system that bad the boot volume is damaged at the block level and can't restore local snapshots. In that case, I'd plug in the USB drive and restore a snapshot from the adjacent data volume, or my backup volume ... all without installing an OS or leaving the bootloader. The benefits of this to mental health are huge; I can tend towards a more "college me" approach to trying random shit from StackOverflow for tweaking my system without having to worry about "adult professional me" being concerned that I don't know what running some random garbage will do to my system. Being able to experiment first, and then learn what's really going on once I find what works, is very relieving and makes tinkering a much less fraught endeavor.
- Being able to per-dataset enable/disable ARC and ZIL means that I can selectively make some actions really fast. My Steam games, for example, are in a high-ARC-bias dataset that starts prewarming (with throttled IO) in the background on boot. Game load times are extremely fast--sometimes at better than single-ext4-SSD levels--and I'm storing all my game installs on spinning rust for $35 (4x 500GB + 2x 32GB cheap SSD for cache)!
One thing that you might not be aware of is that you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab. Keep in mind though that you can only have one checkpoint at a time, they keep growing and growing, and a rollback is for EVERYTHING on the pool.
> you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab
Good to know! Snapshots meet most of my needs at present (since my boot volume is a single fast drive, snapshots ~~ checkpoints in this case), but I could see this coming in useful for future scenarios where I need to do complex or risky things with data volumes or SAN layout changes.
https://github.com/kaspergrubbe/fedora-kernel-compilation/bl...
It builds on top of the exploded fedora kernel tree, adds zfs and spits out a .rpm that you can install with rpm -ivh.
It doesn't play well with dkms because it tries to interfere, so I disable it on my system.
ARC is based in RAM, so how could it reduce performance when used with NVMe devices? They are fast, but they aren't RAM-fast ...
By the way, comments such as yours seem to assume that Oracle is somehow involved with OpenZFS. Oracle has no connection with OpenZFS outside of owning copyright on the original OpenSolaris sources and a few tiny commits their employees contributed before Oracle purchased Sun. Oracle has its own internal ZFS fork and they have zero interest in bringing it to Linux. They want people to either go on their cloud or buy this:
The CDDL was designed to ensure that if Sun Microsystems were acquired by a company hostile to OSS, people could still use Sun’s open source software. In particular, the CDDL has an explicit software patent grant. Some consider that to have been invaluable in preempting lawsuits from a certain company that would rather have ZFS be closed source software.
Oracle only owns the copyright to the original Sun Microsystems code. It doesn’t apply to all ZFS implementations (probably not OracleZFS, perhaps not IllumosZFS) but in the specific case of OpenZFS the majority of the code is no longer Sun code.
Don’t forget that SunZFS was open sourced in 2005 before Oracle bought Sun Microsystems in 2009. Oracle have created their own closed source version of ZFS but outside some Oracle shops nobody uses it (some people say Oracle has stopped working on OracleZFS all together some time ago).
Considering the forks (first from Sun to the various open source implementations and later the fork from open source into Oracle's closed source version) were such a long time ago, there is not that much original code left. A lot of storage tech, or even entire storage concepts, did not exist when Sun open sourced ZFS. Various ZFS implementations developed their own support for TRIM, or Sequential Resilvering, or Zstd compression, or Persistent L2ARC, or Native ZFS Encryption, or Fusion Pools, or Allocation Classes, or dRAID, or RAIDZ expansion long after 2005. That's is why the majority of the code in OpenZFS 2 is from long after the fork from Sun code twenty years ago.
Modern OpenZFS contains new code contributions from Nexenta Systems, Delphix, Intel, iXsystems, Datto, Klara Systems and a whole bunch of other companies that have voluntarily offered their code when most of the non-Oracle ZFS implementations merged to become OpenZFS 2.0.
If you'd want to relicense OpenZFS you could get Oracle to agree for the bit under Sun copyright but for the majority of the code you'd have to get a dozen or so companies to agree to relicensing their contributions (probably not that hard) and many hundreds of individual contributors over two decades (a big task and probably not worth it).
The whole thing smells of some FSF agenda to me. if you ship a cddl file in your gpl project it is still a gpl licensed project and the cddl file is still a cddl licensed file.