379 pointsby scrp4 days ago15 comments
  • scrp4 days ago
    After years in the making ZFS raidz expansaion is finally here.

    Major features added in release:

      - RAIDZ Expansion: Add new devices to an existing RAIDZ pool, increasing storage capacity without downtime.
    
      - Fast Dedup: A major performance upgrade to the original OpenZFS deduplication functionality.
    
      - Direct IO: Allows bypassing the ARC for reads/writes, improving performance in scenarios like NVMe devices where caching may hinder efficiency.
    
      - JSON: Optional JSON output for the most used commands.
    
      - Long names: Support for file and directory names up to 1023 characters.
    • eatbitseveryday3 days ago
      > RAIDZ Expansion: Add new devices to an existing RAIDZ pool, increasing storage capacity without downtime.

      More specifically:

      > A new device (disk) can be attached to an existing RAIDZ vdev

    • cromka3 days ago
      So if I’m running a Proxmox on ZFS and NVMEs, will I be better off enabling Direct IO when 2.3 gets rolled out? What are the use cases for it?
      • 0x457a day ago
        Direct IO useful for databases and other applications that use their own disk caching layer. Without knowing what you run in Proxmox no one will be able to tell you if it's beneficial or not.
      • Saris2 days ago
        I would guess for very high performance NVMe drives.
    • jdboyd4 days ago
      The first 4 seem like really big deals.
      • snvzz4 days ago
        The fifth is also, once you consider non-ascii names.
        • GeorgeTirebiter3 days ago
          Could someone show a legit reason to use 1000-character filenames? Seems to me, when filenames are long like that, they are actually capturing several KEYS that can be easily searched via ls & re's. e.g.

          2025-Jan-14-1258.93743_Experiment-2345_Gas-Flow-375.3_etc_etc.dat

          But to me this stuff should be in metadata. It's just that we don't have great tools for grepping the metadata.

          Heck, the original Macintosh FS had no subdirectories - they were faked by burying subdirectory names in the (flat filesysytem) filename. The original Macintosh File System (MFS), did not support true hierarchical subdirectories. Instead, the illusion of subdirectories was created by embedding folder-like names into the filenames themselves.

          This was done by using colons (:) as separators in filenames. A file named Folder:Subfolder:File would appear to belong to a subfolder within a folder. This was entirely a user interface convention managed by the Finder. Internally, MFS stored all files in a flat namespace, with no actual directory hierarchy in the filesystem structure.

          So, there is 'utility' in "overloading the filename space". But...

          • p_l3 days ago
            > Could someone show a legit reason to use 1000-character filenames?

            1023 byte names can mean less than 250 characters due to use of unicode and utf-8. Add to it unicode normalization which might "expand" some characters into two or more combining characters, deliberate use of combining characters, emoji, rare characters, and you might end up with many "characters" taking more than 4 bytes. A single "country flag" character will be usually 8 bytes, usually most emoji will be at least 4 bytes, skin tone modifiers will add 4 bytes, etc.

            this ' ' takes 27 bytes in my terminal, '󠁧󠁢󠁳󠁣󠁴󠁿' takes 28, another combo I found is 35 bytes.

            And that's on top of just getting a long title using let's say one of CJK or other less common scripts - an early manuscript of somewhat successful Japanese novel has a non-normalized filename of 119 byte, and it's nowhere close to actually long titles, something that someone might reasonably have on disk. A random find on the internet easily points to a book title that takes over 300 bytes in non-normalized utf8.

            P.S. proper title of "Robinson Crusoe" if used as filename takes at least 395 bytes...

            • p_l3 days ago
              hah. Apparently HN eradicated the carefully pasted complex unicode emojis.

              The first was "man+woman kissing" with skin tone modifier, then there was few flags

    • cm21874 days ago
      But I presume it is still not possible to remove a vdev.
      • ryao4 days ago
        That was added a while ago:

        https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...

        It works by making a readonly copy of the vdev being removed inside the remaining space. The existing vdev is then removed. Data can still be accessed from the copy, but new writes will go to an actual vdev while data no longer needed on the copy is gradually reclaimed as free space as the old data is no longer needed.

        • lutorm4 days ago
          Although "Top-level vdevs can only be removed if the primary pool storage does not contain a top-level raidz vdev, all top-level vdevs have the same sector size, and the keys for all encrypted datasets are loaded."
          • ryao4 days ago
            I forgot we still did not have that last bit implemented. However, it is less important now that we have expansion.
            • justinclift3 days ago
              > However, it is less important now that we have expansion.

              Not really sure if that's true. They seem like two different/distinct use cases, though there's probably some small overlap.

          • cm21874 days ago
            And in my case all the vdevs are raidz
      • mustache_kimono4 days ago
        Is this possible elsewhere (re: other filesystems)?
        • cm21874 days ago
          It is possible with windows storage space (remove drive from a pool) and mdadm/lvm (remove disk from a RAID array, remove volume from lvm), which to me are the two major alternatives. Don't know about unraid.
          • lloeki4 days ago
            IIUC the ask (I have a hard time wrapping my head around zfs vernacular), btrfs allows this at least in some cases.

            If you can convince btrfs balance to not use the dev to remove it will simply rebalance data to the other devs and then you can btrfs device remove.

          • mustache_kimono4 days ago
            > It is possible with windows storage space (remove drive from a pool) and mdadm/lvm (remove disk from a RAID array, remove volume from lvm), which to me are the two major alternatives. Don't know about unraid.

            Perhaps I am misunderstanding you, but you can offline and remove drives from a ZFS pool.

            Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure the drive topology?

            • cm21874 days ago
              So for instance I have a ZFS pool with 3 HDD data vdevs, and 2 SSD special vdevs. I want to convert the two SSD vdevs into a single one (or possibly remove one of them). From what I read the only way to do that is to destroy the entire pool and recreate it (it's in a server in a datacentre, don't want to reupload that much data).

              In windows, you can set a disk for removal, and as long as the other disks have enough space and are compatible with the virtual disks (eg you need at least 5 disks if you have parity with number of columns=5), it will rebalance the blocks onto the other disks until you can safely remove the disk. If you use thin provisioning, you can also change your mind about the settings of a virtual disk, create a new one on the same pool, and move the data from one to the other.

              Mdadm/lvm will do the same albeit with more of a pain in the arse as RAID requires to resilver not just the occupied space but also the free space so takes a lot more time and IO than it should.

              It's one of my beef with ZFS, there are lots of no return decisions. That and I ran into some race conditions with loading a ZFS array on boot with nvme drives on ubuntu. They seem to not be ready, resulting in randomly degraded arrays. Fixed by loading the pool with a delay.

              • formerly_proven4 days ago
                My understanding is that ZFS does virtual <-> physical translation in the vdev layer, i.e. all block references in ZFS contain a (vdev, vblock) tuple, and the vdev knows how to translate that virtual block offset into actual on-disk block offset(s).

                This kinda implies that you can't actually remove data vdevs, because in practice you can't rewrite all references. You also can't do offline deduplication without rewriting references (i.e. actually touching the files in the filesystem). And that's why ZFS can't deduplicate snapshots after the fact.

                On the other hand, reshaping a vdev is possible, because that "just" requires shuffling the vblock -> physical block associations inside the vdev.

                • ryao4 days ago
                  There is a clever trick that is used to make top level removal work. The code will make the vdev readonly. Then it will copy its contents into free space on other vdevs (essentially, the contents will be stored behind the scenes in a file). Finally, it will redirect reads on that vdev into the stored vdev. This indirection allows you to remove the vdev. It is not implemented for raid-z at present though.
                  • formerly_proven4 days ago
                    Though the vdev itself still exists after doing that? It just happens to be backed by, essentially, a "file" in the pool, instead of the original physical block devices, right?
              • ryao4 days ago
                The man page says that your example is doable with zpool remove:

                https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...

            • Sesse__4 days ago
              > Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure of the drive topo?

              mdadm can convert RAID-5 to a larger or smaller RAID-5, RAID-6 to a larger or smaller RAID-6, RAID-5 to RAID-6 or the other way around, RAID-0 to a degraded RAID-5, and many other fairly reasonable operations, while the array is online, resistant to power loss and the likes.

              I wrote the first version of this md code in 2005 (against kernel 2.6.13), and Neil Brown rewrote and mainlined it at some point in 2006. ZFS is… a bit late to the party.

              • ryao4 days ago
                Doing this with the on disk data in a merkle tree is much harder than doing it on more conventional forms of storage.

                By the way, what does MD do when there is corrupt data on disk that makes it impossible to know what the correct reconstruction is during a reshape operation? ZFS will know what file was damaged and proceed with the undamaged parts. ZFS might even be able to repair the damaged data from ditto blocks. I don’t know what the MD behavior is, but its options for handling this are likely far more limited.

                • Sesse__4 days ago
                  Well, then they made a design choice in their RAID implementation that made fairly reasonable things hard.

                  I don't know what md does if the parity doesn't match up, no. (I've never ever had that happen, in more than 25 years of pretty heavy md use on various disks.)

                  • ryao4 days ago
                    I am not sure if reshaping is a reasonable thing. It is not so reasonable in other fields. In architecture, if you build a bridge and then want more lanes, you usually build a new bridge, rather than reshape the bridge. The idea of reshaping a bridge while cars are using it would sound insane there, yet that is what people want from storage stacks.

                    Reshaping traditional storage stacks does not consider all of the ways things can go wrong. Handling all of them well is hard, if not impossible to do in traditional RAID. There is a long history of hardware analogs to MD RAID killing parity arrays when they encounter silent corruption that makes it impossible to know what is supposed to be stored there. There is also the case where things are corrupted such that there is a valid reconstruction, but the reconstruction produces something wrong silently.

                    Reshaping certainly is easier to do with MD RAID, but the feature has the trade off that edge cases are not handled well. For most people, I imagine that risk is fine until it bites them. Then it is not fine anymore. ZFS made an effort to handle all of the edge cases so that they do not bite people and doing that took time.

                    • 4 days ago
                      undefined
                    • Sesse__4 days ago
                      > I am not sure if reshaping is a reasonable thing.

                      Yet people are celebrating when ZFS adds it. Was it all for nothing?

                      • ryao4 days ago
                        People wanted it, but it was very hard to do safely. While ZFS now can do it safely, many other storage solutions cannot.

                        Those corruption issues I mentioned, where the RAID controller has no idea what to do, affect far more than just reshaping. They affect traditional RAID arrays when disks die and when patrol scrubs are done. I have not tested MD RAID on edge cases lately, but the last time I did, I found MD RAID ignored corruption whenever possible. It would not detect corruption in normal operation because it assumed all data blocks are good unless SMART said otherwise. Thus, it would randomly serve bad data from corrupted mirror members and always serve bad data from RAID 5/6 members whenever the data blocks were corrupted. This was particularly tragic on RAID 6, where MD RAID is hypothetically able to detect and correct the corruption if it tried. Doing that would come with such a huge performance overhead that it is clear why it was not done.

                        Getting back to reshaping, while I did not explicitly test it, I would expect that unless a disk is missing or disappears during a reshape, MD RAID would ignore any corruption that can be detected using parity and assume all data blocks are good just like it does in normal operation. It does not make sense for MD RAID to look for corruption during a reshape operation, since not only would it be slower, but even if it finds corruption, it has no clue how to correct the corruption unless RAID 6 is used, there are no missing/failed members and the affected stripe does not have any read errors from SMART detecting a bad sector that would effectively make it as if there was a missing disk.

                        You could do your own tests. You should find that ZFS handles edge cases where the wrong thing is in a spot where something important should be gracefully while MD RAID does not. MD RAID is a reimplementation of a technology from the 1960s. If 1960s storage technology handled these edge cases well, Sun Microsystems would not have made ZFS to get away from older technologies.

                        • justinclift3 days ago
                          > While ZFS now can do it safely ...

                          It's the first release with the code, so "safely" might not be the right description until a few point releases happen. ;)

                          • ryao2 days ago
                            It was in development for 8 years. I think it is safe, but time will tell.
                  • amluto3 days ago
                    I’ve experienced bit rot on md. It was not fun, and the tooling was of approximately no help recovering.
            • TiredOfLife4 days ago
              Storage Spaces doesn't dedicate drive to single purpose. It operates in chunks (256MB i think). So one drive can, at the same time, be part of a mirror and raid-5 and raid-0. This allows fully using drives with various sizes. And choosing to remove drive will cause it to redistribute the chunks to other available drives, without going offline.
              • cm21874 days ago
                And as a user it seems to me to be the most elegant design. The quality of the implementation (parity write performance in particular) is another matter.
        • pantalaimon3 days ago
          btrfs has supported online adding and removing of devices to the pool from the start
        • c45y4 days ago
          Bcachefs allows it
          • eptcyka4 days ago
            Cool, just have to wait before it is stable enough for daily use of mission critical data. I am personally optimistic about bcachefs, but incredibly pessimistic about changing filesystems.
            • ryao4 days ago
              It seems easier to copy data to a new ZFS pool if you need to remove RAID-Z top level vdevs. Another possibility is to just wait for someone to implement it in ZFS. ZFS already has top level vdev removal for other types of vdevs. Support for top level raid-z vdev removal just needs to be implemented on top of that.
        • unixhero4 days ago
          Btrfs
          • tw044 days ago
            Except you shouldn’t use btrfs for any parity based raid if you value your data at all. In fact, I’m not aware if any vendor that has implemented btrfs with parity based raid, they all resort to btrfs on md.
    • BodyCulture3 days ago
      How well tested is this in combination with encryption?

      Is the ZFS team handling encryption as a first class priority at all?

      ZFS on Linux inherited a lot of fame from ZFS on Solaris, but everyone using it in production should study the issue tracker very well for a realistic impression of the situation.

      • p_l3 days ago
        Main issue with encryption is occasional attempts by certain (specific) Linux kernel developer to lockout ZFS out of access to advanced instruction set extensions (far from the only weird idea of that specific developer).

        The way ZFS encryption is layered, the features should be pretty much orthogonal from each other, but I'll admit that there's a bit of lacking with ZFS native encryption (though mainly in upper layer tooling in my experience rather than actual on-disk encryption parts)

        • ryao3 days ago
          These are actually wrappers around CPU instructions, so what ZFS does is implement its own equivalents. This does not affect encryption (beyond the inconvenience that we did not have SIMD acceleration for a while on certain architectures).
        • snvzz3 days ago
          >occasional attempts by certain (specific) Linux kernel developer

          Can we please refer to them by the actual name?

      • ryao3 days ago
        The new features should interact fine with encryption. They are implemented at different parts of ZFS' internal stack.

        There have been many man hours put into investigating bug reports involving encryption and some fixes were made. Unfortunately, something appears to be going wrong when non-raw sends of encrypted datasets are received by another system:

        https://github.com/openzfs/zfs/issues/12014

        I do not believe anyone has figured out what is going wrong there. It has not been for lack of trying. Raw sends from encrypted datasets appear to be fine.

      • 3 days ago
        undefined
  • poisonborz4 days ago
    I just don't get it how the Windows world - by far the largest PC platform per userbase - still doesn't have any answer to ZFS. Microsoft had WinFS and then ReFS but it's on the backburner and while there is active development (Win11 ships some bits time to time) release is nowhere in sight. There are some lone warriors trying the giant task of creating a ZFS compatibility layer with some projects, but they are far from being mature/usable.

    How come that Windows still uses a 32 year old file system?

    • GuB-424 days ago
      To be honest, the situation with Linux is barely better.

      ZFS has license issues with Linux, preventing full integration, and Btrfs is 15 years in the making and still doesn't match ZFS in features and stability.

      Most Linux distros still use ext4 by default, which is 19 years old, but ext4 is little more than a series of extensions on top of ext2, which is the same age as NTFS.

      In all fairness, there are few OS components that are as critical as the filesystem, and many wouldn't touch filesystems that have less than a decade of proven track record in production.

      • mogoh4 days ago
        ZFS might be better then any other FS on Linux (I don't judge that).

        But you must admit that the situation on Linux is quite better then on Windows. Linux has so many FS in main branch. There is a lot of development. BTRFS had a rocky start, but it got better.

      • stephen_g4 days ago
        I’m interested to know what ‘full integration’ does look like, I use ZFS in Proxmox (Debian-based) and it’s really great and super solid, but I haven’t used ZFS in more vanilla Linux distros. Does Proxmox have things that regular Linux is missing out on, or are there shortcomings and things I just don’t realise about Proxmox?
        • whataguy4 days ago
          The difference is that the ZFS kernel module is included by default with Proxmox, whereas with e.g. Debian, you would need to install it manually.
          • pimeys4 days ago
            And you can't follow the latest kernel before the ZFS module supports it.
            • ryao3 days ago
              There is a trick for this:

                * Step 1: Make friends with a ZFS developer.
                * Step 2: Guilt him into writing patches to add support as soon as a new kernel is released.
                * Step 3: Enjoy
              
              Adding support for a new kernel release to ZFS is usually only a few hours of work. I have done it in the past more than a dozen times.
            • BSDobelix4 days ago
              Try CachyOS https://cachyos.org/ , you can even swap from an existing Arch installation:

              https://wiki-dev.cachyos.org/sk/cachyos_repositories/how_to_...

            • gf0003 days ago
              I use NixOS, and it simply updates to the latest kernel that supports zfs, with a single, declerative option.
            • blibble4 days ago
              for Debian that's not exactly a problem
              • oarsinsync4 days ago
                Unless you’re using Debian backports, and they backport a new kernel a week before the zfs backport package update happens.

                Happened to me more than once. I ended up manually changing the kernel version limitations the second time just to get me back online, but I don’t recall if that ended up hurting me in the long run or not.

        • BodyCulture3 days ago
          You probably don’t realise how important encryption is.

          It’s still not supported by Proxmox, yes, you can do it yourself somehow but you are alone then and miss features and people report problems with double or triple file system layers.

          I do not understand how they have not encryption out of the box, this seems to be a problem.

          • kevinmgranger3 days ago
            I'm not sure about proxmox, but ZFS on Linux does have encryption.
      • lousken4 days ago
        as far as stability goes, btrfs is used by meta, synology and many others, so I wouldn't say it's not stable, but some features are lacking
        • azalemeth4 days ago
          My understanding is that single-disk btrfs is good, but raid is decidedly dodgy; https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5... states that:

          > The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6.

          > There are some implementation and design deficiencies that make it unreliable for some corner cases and *the feature should not be used in production, only for evaluation or testing*.

          > The power failure safety for metadata with RAID56 is not 100%.

          I have personally been bitten once (about 10 years ago) by btrfs just failing horribly on a single desktop drive. I've used either mdadm + ext4 (for /) or zfs (for large /data mounts) ever since. Zfs is fantastic and I genuinely don't understand why it's not used more widely.

          • crest4 days ago
            One problem with your setup is that ZFS by design can't use a traditional *nix filesystem buffer cache. Instead it has to use its own ARC (adaptive replacement cache) with end-to-end checksumming, transparent compression, and copy-on-write semantics. This can lead to annoying performance problems when the two types of file system caches contest for available memory. There is a back pressure mechanism, but it effectively pauses other writes while evicting dirty cache entries to release memory.
            • ryao4 days ago
              Traditionally, you have the page cache on top of the FS and the buffer cache below the FS, with the two being unified such that double caching is avoided in traditional UNIX filesystems.

              ZFS goes out of its way to avoid the buffer cache, although Linux does not give it the option to fully opt out of it since the block layer will buffer reads done by userland to disks underneath ZFS. That is why ZFS began to purge the buffer cache on every flush 11 years ago:

              https://github.com/openzfs/zfs/commit/cecb7487fc8eea3508c3b6...

              That is how it still works today:

              https://github.com/openzfs/zfs/blob/fe44c5ae27993a8ff53f4cef...

              If I recall correctly, the page cache is also still above ZFS when mmap() is used. There was talk about fixing it by having mmap() work out of ARC instead, but I don’t believe it was ever done, so there is technically double caching done there.

              • taskforcegemini3 days ago
                what's the best way to deal with this then? disable filecache of linux? I've tried disabling/minimizing arc in the past to avoid the oom reaper, but the arc was stubborn and its RAM usage remained as is
                • ryao3 days ago
                  These days, ZFS frees memory fast enough when Linux requests memory to be freed that you generally do not see OOM because of ZFS, but if you have a workload where it is not fast enough, you can limit the maximum arc size to try to help:

                  https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

                • ssl-33 days ago
                  I didn't have any trouble limiting zfs_arc_max to 3GB on one system where I felt that it was important. I ran it that way for a fair number of years and it always stayed close to that bound (if it was ever exceeded, it wasn't by a noteworthy amount at any time when I was looking).

                  At the time, I had it this way because I had fear of OOM events causing [at least] unexpected weirdness.

                  A few months ago I discovered weird issues with a fairly big, persistent L2ARC being ignored at boot due to insufficient ARC. So I stopped arbitrarily limiting zfs_arc_max and just let it do its default self-managed thing.

                  So far, no issues. For me. With my workload.

                  Are you having issues with this, or is it a theoretical problem?

            • 4 days ago
              undefined
          • lousken4 days ago
            I was assuming OP wants to highlight filesystem use on a workstation/desktop, not for a file server/NAS. I had similar experience decade ago, but these days single drives just work, same with mirroring. For such setups btrfs should be stable. I've never seen a workstation with raid5/6 setup. Secondly, filesystems and volume managers are something else, even if e.g. btrfs and ZFS are essentialy both.

            For a NAS setup I would still prefer ZFS with truenas scale (or proxmox if virtualization is needed), just because all these scenarios are supported as well. And as far as ZFS goes, encryption is still something I am not sure about especially since I want to use snapshots sending those as a backup to remote machine.

          • hooli_gan4 days ago
            RAID5/6 is not needed with btrfs. One should use RAID1, which supports striping the same data onto multiple drives in a redundant way.
            • johnmaguire3 days ago
              How can you achieve 2-disk fault tolerance using btrfs and RAID 1?
              • Dalewyn3 days ago
                By using three drives.

                RAID1 is just making literal copies, so each additional drive in a RAID1 is a self-sufficient copy. You want two drives of fault tolerance? Use three drives, so if you lose two copies you still have one left.

                This is of course hideously inefficient as you scale larger, but that is not the question posed.

                • johnmaguire3 days ago
                  > This is of course hideously inefficient as you scale larger, but that is not the question posed.

                  It's not just inefficient, you literally can't scale larger. Mirroring is all that RAID 1 allows for. To scale, you'd have to switch to RAID 10, which doesn't allow two-disk fault tolerance (you can get lucky if they are in different stripes, but this isn't fault tolerance.)

                  But you're right - RAID 1 also scales terribly compared to RAID 6, even before introducing striping. Imagine you have 6 x 16 TB disks:

                  With RAID 6, usable space of 64 TB, two-drive fault tolerance.

                  With RAID 1, usable space of 16 TB, five-drive fault tolerance.

                  With RAID 10, usable space of 32 GB, one-drive fault tolerance.

                • ryao3 days ago
                  Btrfs did not support that until Linux 5.5 when it added RAID1c3. On its mirror devices instead of doing mirroring, it just stores 2 copies, no matter how many mirror members you have.
                • 3 days ago
                  undefined
          • brian_cunnie4 days ago
            > I have personally been bitten once (about 10 years ago) by btrfs just failing horribly on a single desktop drive.

            Me, too. The drive was unrecoverable. I had to reinstall from scratch.

          • worthless-trash3 days ago
            Licensing incompatibilities.
        • jeltz3 days ago
          It is possible to corrupt the file system from user space as a normal user with Btrfs. The PostgreSQL devs found that when working on async IO. And as fer as I know that issue has not been fixed.

          https://www.postgresql.org/message-id/CA%2BhUKGL-sZrfwcdme8j...

        • _joel4 days ago
          I'm similar to some other people here, I guess once they've been bitten by data loss due to btrfs, it's difficult to advocate for it.
          • lousken4 days ago
            I am assuming almost everybody at some point experienced data loss because they pulled out a flash drive too early. Is it safe to assume that we stopped using flash drives because of it?
            • _joel4 days ago
              I'm not sure we have stopped using flash, judging by the pile of USB sticks on my desk :) In relation to the fs analogy if you used a flash drive that you know corrupted your data, you'd throw it away for one you know works.
              • ryao3 days ago
                I once purchased a bunch of flash drives from Google’s online swag store and just unplugging them was often enough to put then in a state where they claimed to be 8MB devices and nothing I wrote to them was ever possible to read back in my limited tests. I stopped using those fast.
        • fourfour34 days ago
          Do Synology actually use the multi-device options of btrfs, or are they using linux softraid + lvm underneath?

          I know Synology Hybrid RAID is a clever use of LVM + MD raid, for example.

          • phs25013 days ago
            I believe Synology runs btrfs on top of regular mdraid + lvm, possibly with patches to let btrfs checksum failures reach into the underlying layers to find the right data to recover.

            Related blog post: https://daltondur.st/syno_btrfs_1/

            • fourfour3a day ago
              That was very interesting reading, thanks!
      • cesarb4 days ago
        > Btrfs [...] still doesn't match ZFS in features [...]

        Isn't the feature in question (array expansion) precisely one which btrfs already had for a long time? Does ZFS have the opposite feature (shrinking the array), which AFAIK btrfs also already had for a long time?

        (And there's one feature which is important to many, "being in the upstream Linux kernel", that ZFS most likely will never have.)

        • wkat42424 days ago
          ZFS also had expansion for a long time but it was offline expansion. I don't know if btrfs has also had online for a long time?

          And shrinking no, that is a big missing feature in ZFS IMO. Understandable considering its heritage (large scale datacenters) but nevertheless an issue for home use.

          But raidz is rock-solid. Btrfs' raid is not.

          • unsnap_biceps3 days ago
            Raidz wasn't able to be expanded in place before this. You were able to add to a pool that included a raidz vdev, but that raidz vdev was immutable.
            • wkat42423 days ago
              Oh ok, I've never done this, but I thought it was already there. Maybe this was the original ZFS from Sun? But maybe I just remember it incorrectly, sorry.

              I've used it on multi-drive arrays but I never had the need for expansion.

              • ryao3 days ago
                You could add top level raidz vdevs or replace the members of a raid-z vdev with larger disks to increase storage space back then. You still have those options now.
      • honestSysAdmin3 days ago

          https://openzfs.github.io/openzfs-docs/Getting%20Started/index.html
        
        ZFS runs on all major Linux distros, the source is compiled locally and there is no meaningful license problem. In datacenter and "enterprise" environments we compile ZFS "statically" with other kernel modules all the time.

        For over six years now, there is an "experimental" option presented by the graphical Ubuntu installer to install the root filesystem on ZFS. Almost everyone I personally know (just my anecdote) chooses this "experimental" option. There has been an occasion here and there of ZFS snapshots taking up too much space, but other than this there have not been any problems.

        I statically compile ZFS into a kernel that intentionally does not support loading modules on some of my personal laptops. My experience has been great, others' mileage may (certainly will) vary.

      • bayindirh4 days ago
        > Most Linux distros still use ext4 by default, which is 19 years old, but ext4 is little more than a series of extensions on top of ext2, which is the same age as NTFS.

        However, ext4 and XFS are much more simpler and performant than BTRFS & ZFS as root drives on personal systems and small servers.

        I personally won't use either on a single disk system as root FS, regardless of how fast my storage subsystem is.

        • ryao3 days ago
          ZFS will outscale ext4 in parallel workloads with ease. XFS will often scale better than ext4, but if you use L2ARC and SLOG devices, it is no contest. On top of that, you can use compression for an additional boost.

          You might also find ZFS outperforms both of them in read workloads on single disks where ARC minimizes cold cache effects. When I began using ZFS for my rootfs, I noticed my desktop environment became more responsive and I attributed that to ARC.

          • jeltz3 days ago
            Not on most database workloads. There zfs does not scale very well.
            • ryao3 days ago
              Percona and many others who benchmarked this properly would disagree with you. Percona found that ext4 and ZFS performed similarly when given identical hardware (with proper tuning of ZFS):

              https://www.percona.com/blog/mysql-zfs-performance-update/

              In this older comparison where they did not initially tune ZFS properly for the database, they found XFS to perform better, only for ZFS to outperform it when tuning was done and a L2ARC was added:

              https://www.percona.com/blog/about-zfs-performance/

              This is roughly what others find when they take the time to do proper tuning and benchmarks. ZFS outscales both ext4 and XFS, since it is a multiple block device filesystem that supports tiered storage while ext4 and XFS are single block device filesystems (with the exception of supporting journals on external drives). They need other things to provide them with scaling to multiple block devices and there is no block device level substitute for supporting tiered storage at the filesystem level.

              That said, ZFS has a killer feature that ext4 and XFS do not have, which is low cost replication. You can snapshot and send/recv without affecting system performance very much, so even in situations where ZFS is not at the top in every benchmark such as being on equal hardware, it still wins, since the performance penalty of database backups on ext4 and XFS is huge.

              • LtdJorge3 days ago
                There is no way that a CoW filesystem with parity calculations or striping is gonna beat XFS on multiple disks, specially on high speed NVMe.

                The article provides great insight into optimizing ZFS, but using an EBS volume as store (with pretty poor IOPS) and then giving the NVMe as metadata cache only for ZFS feels like cheating. At the very least, metadata for XFS could have been offloaded to the NVMe too. I bet if we store set XFS with metadata and log to a RAMFS it will beat ZFS :)

                • ryao3 days ago
                  L2ARC is a cache. Cache is actually part of its full name, which is Level 2 Adaptive Replacement Cache. It is intended to make fast storage devices into extensions of the in memory Adaptative Replacement Cache. L2ARC functions as a victim cache. While L2ARC does cache metadata, it caches data too. You can disable the data caching, but performance typically suffers when you do. While you can put ZFS metadata on a special device if you want, that was not the configuration that Percona evaluated.

                  If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do. Using a feature ZFS has to improve performance at price point that XFS cannot match is competition, not cheating.

                  ZFS cleverly uses CoW in a way that eliminates the need for a journal, which is overhead for XFS. CoW also enables ZFS' best advantage over XFS, which is that database backups on ZFS via snapshots and (incremental) send/recv affect system performance minimally where backups on XFS are extremely disruptive to performance. Percona had high praise for database backups on ZFS:

                  https://www.percona.com/blog/zfs-for-mongodb-backups/

                  Finally, there were no parity calculations in the configurations that Percona tested. Did you post a preformed opinion without taking the time to actually understand the configurations used in Percona's benchmarks?

                  • LtdJorge10 hours ago
                    No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes. The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

                    About the inherent advantages of ZFS like send/recv, I have nothing to say. I know how good they are. It's one reason I use ZFS.

                    > If you do proper testing, you will find ZFS does beat XFS if you scale it. Its L2ARC devices are able to improve IOPS of storage cheaply, which XFS cannot do.

                    What does proper testing here mean? And what does "if you scale it" mean? Genuinely. From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting. What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

                    Edit: P4800x, actually. The flash disk are D5-P5530.

                    • ryao8 hours ago
                      > No I didn't. I separated my thoughts in two paragraphs, the first doesn't have anything to do with the articles, it was just about the general use case for ZFS, which is using it with redundant hardware. I also conflated the L2ARC with metadata device, yes.

                      That makes sense.

                      > The point about the second paragraph was that the using a much faster device just on one of the comparisons doesn't seem fair to me. Of course, if you had a 1TB ZFS HDD and 1TB of RAM as ARC the "HDD" would be the fastest on earth, lol.

                      it is a balancing act. It is a feature ZFS has that XFS does not, but it is ridiculous to use a device that can fit the entire database as L2ARC, since in that case, you can just use that device directly and keeping it as a cache for ZFS does not make for a fair or realistic comparison. Fast devices that can be used with tiered storage are generally too small to be used as main storage, since if you could use them as main storage, you would.

                      With the caveat that the higher tier should be too small to be used as main storage, you can get a huge boost from being able to use it as cache in tiered storage, and that is why ZFS has L2ARC.

                      > What does proper testing here mean? And what does "if you scale it" mean?

                      Let me preface my answer by saying that doing good benchmarks is often hard, so I can't give a simple answer here. However, I can give a long answer.

                      First, small databases that can fit entirely in RAM cache (be it the database's own userland cache or a kernel cache) are pointless to benchmark. In general, anything can run that well (since it is really running out of RAM as you pointed out). The database needs to be significantly larger than RAM.

                      Second, when it comes to using tiered storage, the purpose of doing tiering is that the faster tier is either too small or too expensive to use for the entire database. If the database size is small enough that it is inexpensive to use the higher tier for general storage, then a test where ZFS gets the higher tiered storage for use as cache is neither fair nor realistic. Thus, we need to scale the database to a larger size such that the higher tier being only usable as cache is a realistic scenario. This is what I had in mind when I said "if you scale it".

                      Third, we need to test workloads that are representative of real things. This part is hard and the last time I did it was 2015 (I had previously said 2016, but upon recollection, I realized it was likely 2015). When I did, I used a proprietary workload simulator that was provided by my job. It might have been from SPEC, but I am not sure.

                      Fourth, we need to tune things properly. I wrote the following documentation years ago describing correct tuning for ZFS:

                      https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

                      https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

                      At the time I wrote that, I omitted that tuning the I/O elevator can also improve performance, since there is no one size fits all advice for how to do it. Here is some documentation for that which someone else wrote:

                      https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

                      If you are using SSDs, you could probably just get away with setting each of the maximum asynchronous queue depth limits to something like 64 (or even 256) and benchmark that.

                      > From my basic testing and what I've got from online benchmarks, ZFS tends to be a bit slower than XFS in general. Of course, my testing is not thorough because there are many things to tune and documentation is scattered around and sometimes conflicting.

                      In 2015 when I did database benchmarks, ZFS and XFS were given equal hardware. The hardware was a fairly beefy EC2 instance with 4x high end SSDs. MD RAID 0 was used under XFS while ZFS was given the devices in what was effectively a RAID 0 configuration. With proper tuning (what I described earlier in this reply), I was able to achieve 85% of XFS performance in that configuration. This was considered a win due to the previously stated reason of performance under database backups. ZFS has since had performance improvements done, which would probably narrow the gap. It now uses B-Trees internally to do operations faster and also now has redundant_metadata=most, which was added for database workloads.

                      Anyway, on equal hardware in a general performance comparison, I would expect ZFS to lose to XFS, but not by much. ZFS' ability to use tiered storage and do low overhead backups is what would put it ahead.

                      > What would you say is a configuration where ZFS will beat XFS on flash? I have 4x Intel U.2 drives with 2x P5800X empty as can be, I could test on them right now. I wanna make clear, that I'm not saying it's 100% impossible ZFS beats XFS, just that I find it very unlikely.

                      You need to have a database whose size is so big that optane storage is not practical to use for main storage. Then you need to setup ZFS with Optane storage as L2ARC. You can give regular flash drives to ZFS and XFS on MD RAID in a comparable configuration (RAID 0 to make life easier, although in practice you probably want to use RAID 10). You will want to follow best practices for tuning the database and filesystems (although from what I know, XFS has remarkably few knobs). You could give XFS the optane devices to use for metadata and its journal for fairness, although I do not expect it to help XFS enough. In this situation, ZFS should win on performance.

                      You would need to pick a database for this. One option would be PostgreSQL, which is probably the main open source database that people would scale to such levels. The pgbench tool likely could be used for benchmarking.

                      https://www.postgresql.org/docs/current/pgbench.html

                      You would need to pick a scaling factor that will make the database big enough and do a workload simulating a large number of clients (what is large is open to interpretation).

                      Finally, I probably should add that the default script used by pgbench probably is not very realistic for a database workload. A real database will have a good proportion of reads from select queries (at least 50%) while the script that is being used does a write mostly workload. It probably should be changed. How is probably an exercise best left for a reader. That is not the answer you probably want to hear, but I did say earlier in this reply that doing proper benchmarks is hard, and I do not know offhand how to adjust the script to be more representative of real workloads. That said, there is definite utility in benchmarking write mostly workloads too, although that utility is probably more applicable for the database developers than as a way to determine which of two filesystems is better for running the database.

              • menaerus3 days ago
                Refuting the "it doesn't scale" argument with a data from a blog that showcases a single workload (TPC-C) with 200G+10tables dataset (small to medium) at 2vCPU (wtf) machine with 16 connections (no thread pool so overprovisioned) is not quite a definition of a scale at all. It's a lost experiment if anything.
                • ryao2 days ago
                  The guy did not have any data to justify his claims of not scaling. Percona’s data says otherwise. If you don’t like how they got their data, then I advise you to do your own benchmarks.
                  • jeltz2 days ago
                    It is based on data from internal benchmarks. Zfs is fine for database workloads but scales worse than Xfs based on my personal experience. It is unpublished benchmarks and I do not have access to any farm to win a discussion on the internet.
                    • ryao2 days ago
                      I did internal benchmarks at ClusterHQ in 2016. Those benchmarks showed that a tuned ZFS FS of the time had 85% the performance of XFS on equal hardware (a beefy EC2 instance with 4 SSDs, with XFS using MD RAID 0), but it was considered a win for ZFS because of the performance difference when running backups. L2ARC was not considered since the underlying storage was already SSD based and there was nothing faster, but in practice, you often can use it with a faster tier of storage and that puts ZFS ahead even without considering the substantial performance dips of backups.
                  • menaerus2 days ago
                    I don't have anything to like or not to like. I'm not a user of ZFS filesystem. I'm just dismissing your invalid argumentation. Percona's data is nothing about the scale for reasons I already mentioned.
                    • ryao2 days ago
                      The argument he made was invalid without data to back it up. I at least cited something. The remarks on the performance when backups are made and the benefits of L2ARC were really the most important points, and are far from invalid.
          • bayindirh3 days ago
            No doubt. I want to reiterate my point. Citing myself:

            > "I personally won't use either on a single disk system as root FS, regardless of how fast my storage subsystem is." (emphasis mine)

            We are no strangers to filesystems. I personally benchmarked a ZFS7320 extensively, writing a characterization report, plus we have a ZFS7420 for a very long time, complete with separate log SSDs for read and write on every box.

            However, ZFS is not saturation proof, plus is nowhere near a Lustre cluster performance wise, when scaled.

            What kills ZFS and BTRFS on desktop systems are write performance, esp. on heavy workloads like system updates. If I need a desktop server (performance-wise), I'd configure it accordingly and use these, but I'd never use BTRFS or ZFS on a single root disk due to their overhead, to reiterate myself thrice.

            • ryao3 days ago
              I am generally happy with the write performance of ZFS. I have not noticed slow system updates on ZFS (although I run Gentoo, so slow is relative here). In what ways is the write performance bad?

              I am one of the OpenZFS contributors (although I am less active as late). If you bring some deficiency to my attention, there is a chance I might spend the time needed to improve upon it.

              By the way, ZFS limits the outstanding IO queue depth to try to keep latencies down as a type of QoS, but you can tune it to allow larger IO queue depths, which should improve write performance. If your issue is related to that, it is an area that is known to be able to use improvement in certain situations:

              https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

              https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

              https://openzfs.github.io/openzfs-docs/Performance%20and%20T...

              • bayindirh3 days ago
                What I see with CoW filesystems is, when you force the FS to sync a lot (like apt does to keep immunity against power losses to a maximum), the write performance slouches visibly. This also means that when you're writing a lot of small files with a lot of processes and flood the FS with syncs, you get the same slouching, making everything slower in the process. This effect is better controlled in simpler filesystems, namely XFS and EXT4. This is why I keep backups elsewhere and keep my single disk rootfs on "simple" filesystems.

                I'll be installing a 2 disk OpenZFS RAID1 volume on a SBC for high value files soon-ish, and I might be doing some tests on that when it's up. Honestly, I don't expect stellar performance since I'll be already putting it on constrained hardware, but let you know if I experience anything that doesn't feel right.

                Thanks for the doc links, I'll be devouring them when my volume is up and running.

                Where do you prefer your (bug and other) reports? GitHub? E-mail? IP over Avian Carriers?

                • ryao2 days ago
                  Heavy synchronous IO from incredibly frequent fsync is a weak point. You can make it better using SLOG devices. I realize what I am about to say is not what you want to hear, but any application doing excessive fsync operations is probably doing things wrong. This is a view that you will find prevalent among all filesystem developers (i.e. the ext4 and XFS guys will have this view too). That is because all filesystems run significantly faster when fsync() is used sparingly.

                  In the case of APT, it should install all of the files and then call sync() once. This is equivalent of calling fsync on every file like APT currently does, but aggregates it for efficiency. The reason APT does not use sync() is probably a portability thing, because the standard does not require sync() to be blocking, but on Linux it is:

                  https://www.man7.org/linux/man-pages/man2/sync.2.html

                  From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package. Thus it does not really matter for power loss protection if you are using fsync() on all files or sync() once for all files, since what must happen next to fix it is the same. However, from a performance perspective, it really does matter.

                  That said, slow fsync performance generally is not an issue for desktop workloads because they rarely ever use fsync. APT is the main exception. You are the first to complain about APT performance in years as far as I know (there were fixes to improve APT performance 10 years ago, when its performance was truly horrendous).

                  You can file bug reports against ZFS here:

                  https://github.com/openzfs/zfs

                  I suggest filing a bug report against APT. There is no reason for it to be doing fsync calls on every file it installs in the filesystem. It is inefficient.

                  • bayindirh2 days ago
                    Actually this was discussed recently [0]. While everybody knows it's not efficient, it's required to keep update process resilient against unwanted shutdowns (like power losses which corrupt the filesystem due to uncommitted work left on the filesystem).

                    > From a power loss perspective, if power is lost when installing a package into the filesystem, you need to repair the package.

                    Yes, but at least you have all the files, otherwise you can have 0 length files which can prevent you from booting your system. In this case, your system boots, all files are in place, but some packages are in semi-configured state. Believe me, apt can recover from many nasty corners without any ill effects as long as all files are there. I used to be a tech-lead for a Debian derivative back in the day, so I lived in the trenches in Debian for a long time, so I have seen things.

                    Again it's decided that the massive sync will stay in place for now, because the risks involved in the wild doesn't justify the performance difference yet. If you prefer to be reckless, there's "eatmydata" and "--force-unsafe-io" options baked in already.

                    Thanks for the links, I'll let you know if I find something. I just need to build the machine from the parts I have, then I'll be off to the races.

                    [0]: https://lists.debian.org/debian-devel/2024/12/msg00533.html [warning, long thread]

                    • ryao2 days ago
                      This email mentions a bunch of operations that are done per file to ensure the file put in the final location always has the correct contents:

                      https://lists.debian.org/debian-devel/2024/12/msg00540.html

                      It claims that the fsync is needed to avoid the file appearing at the final location with a zero length after a power loss. This is not true on ZFS.

                      ZFS puts every filesystem operation into a transaction group that is committed atomically about every 5 seconds by default. On power loss, the transaction group either succeeds or never happens. The result is that even without using fsync, there will never be a zero length file at the final location because the rename being part of a successful transaction group commit implies that the earlier writes also were part of a successful transaction group commit.

                      The result is that you can use --force-unsafe-io with dpkg on ZFS, things will run faster and there should be no issues for power loss recovery as far as zero length files go.

                      The following email mentions that sync() had been used at one point but caused problems when flash drives were connected, so it was dropped:

                      https://lists.debian.org/debian-devel/2024/12/msg00597.html

                      The timeline is unclear, but I suspect this happened before Linux 2.6.29 introduced syncfs(), which would have addressed that. Unfortunately, it would have had problems for systems with things like a separate /usr mount, which requires the package manager to realize multiple syncfs calls are needed. It sounds like dpkg was calling sync() per file, which is even worse than calling fsync() per file, although it would have ensured that the directory entries for prior files were there following a power loss event.

                      The email also mentions that fsync is not called on directories. The result is that a power loss event (on any Linux filesystem, not just ZFS) could have the files missing from multiple packages marked as installed in the package database, which is said to use fsync to properly record installations. I find this situation weird since I would use sync() to avoid this, but if they are comfortable having systems have multiple “installed” packages missing files in the filesystem after a power loss, then there is no need to use sync().

                • gf0003 days ago
                  Hi! I am quite a beginner when it comes to file systems. Would this sync effect not be helped by direct IO in ZFS's case?

                  Also, given that you seem quite knowledgeable of the topic, what is your go-to backup solution?

                  I initially thought about storing `zfs send` files into backblaze (as backup at a different location), but without recv-ing these, I don't think the usual checksumming works properly. I can checksum the whole before and after updating, but I'm not convinced if this is the best solution.

                  • ryao2 days ago
                    No, it will not. It would be helped by APT switching to using a single sync/syncfs call after installing all files, which is the performant way to do what it wants on Linux:

                    https://www.man7.org/linux/man-pages/man2/sync.2.html

                    • ryao2 days ago
                      After studying the DPKG developers’ reasoning for using fsync excessively, it turns out that there is no need for them to use fsync on a ZFS rootfs. When the rootfs is ZFS, you can use --force-unsafe-io to skip the fsync operations for a speed improvement and there will be no safety issues due to how ZFS is designed.

                      DPKG will write each file to a temporary location and then rename it to the final location. On ext4, without fsync, when a power loss event occurs, it is possible for the rename to the final location to be done, without any of the writes such that you have a zero length file. On ZFS, the rename being done after the writes means that the rename being done implies the writes were done due to the sequential nature of ZFS’ transaction group commit, so the file will never appear in the final location without the file contents following a power loss event, which is why ZFS does not need the fsync there.

      • xattt4 days ago
        ZFS on OS X was killed because of Oracle licensing drama. I don’t expect anything better on Windows either.
        • ryao4 days ago
          There is a third party port here:

          https://openzfsonosx.org/wiki/Main_Page

          It was actually the NetApp lawsuit that caused problems for Apple’s adoption of ZFS. Apple wanted indemnification from Sun because of the lawsuit, Sun’s CEO did not sign the agreement before Oracle’s acquisition of Sun happened and Oracle had no interest in granting that, so the official Apple port was cancelled.

          I heard this second hand years later from people who were insiders at Sun.

          • xattt4 days ago
            That’s a shame re: NetApp/ZFS.

            While third-party ports are great, they lack deep integration that first-party support would have brought (non-kludgy Time Machine which is technically fixed with APFS).

            • 3 days ago
              undefined
        • throw0101a4 days ago
          > ZFS on OS X was killed because of Oracle licensing drama.

          It was killed because Apple and Sun couldn't agree on a 'support contract'. From Jeff Bonwick, one of the co-creators ZFS:

          >> Apple can currently just take the ZFS CDDL code and incorporate it (like they did with DTrace), but it may be that they wanted a "private license" from Sun (with appropriate technical support and indemnification), and the two entities couldn't come to mutually agreeable terms.

          > I cannot disclose details, but that is the essence of it.

          * https://archive.is/http://mail.opensolaris.org/pipermail/zfs...

          Sun took DTrace, licensed via CDDL—just like ZFS—and put it into the kernel without issue. Of course a file system is much more central to an operating system, so they wanted much more of a CYA for that.

        • BSDobelix4 days ago
          >ZFS on OS X was killed because of Oracle licensing drama.

          Naa it was Jobs ego not the license:

          >>Only one person at Steve Jobs' company announces new products: Steve Jobs.

          https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-ap...

          • bolognafairy4 days ago
            It’s a cute story that plays into the same old assertions about Steve Jobs, but the conclusion is mostly baseless. There are many other, more credible, less conspiratorial, possible explanations.
            • wkat42424 days ago
              It could have played into it though, but I agree the support contract that couldn't be worked out mentioned elsewhere in the thread is more likely.

              But I think these things are usually a combination. When a business relationship sours, agreements are suddenly much harder to work out. The negotiators are still people and they have feelings that will affect their decisionmaking.

            • 4 days ago
              undefined
            • BSDobelix4 days ago
              [flagged]
      • nabla94 days ago
        License is not a real issue. It must be just distributed in separate module. No big hurdle.
        • crest4 days ago
          The main hurdle is hostile Linux kernel developers who aren't held accountable intentionally breaking ZFS for their own petty ideological reasons e.g. removing the in-kernel FPU/SIMD register save/restore API and replacing it with a "new" API to do the the same.

          What's "new" about the "new" API? Its symbols are GPL2 only to deny it's use to non-GPL2 modules (like ZFS). Guess that's an easy way to make sure that BTRFS is faster than ZFS or set yourself up as the (to be) injured party.

          Of course a reimplementation of the old API in terms of the new is an evil "GPL condom" violating the kernel license right? Why can't you see ZFS's CDDL2 license is the real problem here for being the wrong flavour of copyleft license. Way to claim the moral high ground you short-sighted, bigoted pricks. sigh

        • Jnr4 days ago
          From my point of view it is a real usability issue.

          zfs modules are not in the official repos. You either have to compile it on each machine or use unofficial repos, which is not exactly ideal and can break things if those repos are not up to date. And I guess it also needs some additional steps for secureboot setup on some distros?

          I really want to try zfs because btrfs has some issues with RAID5 and RAID6 (it is not recommended so I don't use it) but I am not sure I want to risk the overall system stability, I would not want to end up in a situation where my machines don't boot and I have to fix it manually.

          • chillfox4 days ago
            I have been using ZFS on Mint and Alpine Linux for years for all drives (including root) and have never had an issue. It's been fantastic and is super fast. My linux/zfs laptop loads games much faster than an identical machine running Windows.

            I have never had data corruption issues with ZFS, but I have had both xfs and ext4 destroy entire discs.

          • harshreality4 days ago
            Why are you considering raid5/6? Are you considering building a large storage array? If the data will fit comfortably (50-60% utilization) on one drive, all you need is raid1. Btrfs is fine for raid1 (raid1c3 for extra redundancy); it might have hidden bugs, but no filesystem is immune from those; zfs had a data loss bug (it was rare, but it happened) a year ago.

            Why use zfs for a boot partition? Unless you're using every disk mounting point and nvme slot for a single large raid array, you can use a cheap 512GB nvme drive or old spare 2.5" ssd for the boot volume. Or two, in btrfs raid1 if you absolutely must... but do you even need redundancy or datasum (which can hurt performance) to protect OS files? Do you really care if static package files get corrupted? Those are easily reinstalled, and modern quality brand SSDs are quite reliable.

            • Jnr3 days ago
              I am already using ext4 for /boot and / on nvme, and I am happy with that.

              I want to use raid 5 for the large storage mount point that holds non-OS files. I want both space and redundancy. Currently I have several separate raid1 btrfs mounts since it is recommended against raid5.

        • GuB-424 days ago
          It is a problem because most of the internal kernel APIs are GPL-only, which limit the abilities of the ZFS module. It is a common source of argument between the Linux guys and the ZFS on Linux guys.

          The reason for this is not just to piss off non-GPL module developers. GPL-only internal APIs are subject to change without notice, even more so than the rest of the kernel. And because the licence may not allow the Linux kernel developers to make the necessary changes to the module when it happens, there is a good chance it breaks without warning.

          And even with that, all internal APIs may change, it is just a bit less likely than for the GPL-only ones, and because ZFS on Linux is a separate module, there is no guarantee for it to not break with successive Linux versions, in fact, it is more like a guarantee that it will break.

          Linux is proudly monolithic, and as constantly evolving a monolithic kernel, developers need to have control over the entire project. It is also community-driven. Combined, you need rules to have the community work together, or everything will break down, and that's what the GPL is for.

        • nijave3 days ago
          I remember it being a pain in the ass on Fedora which tracks closely to mainline. Frequently a new kernel version would come out that zfs module didn't support so you'd have to downgrade and hold back the package until support was added.

          Fedora packages zfs-fuse. I think some distros have arrangements to make sure kernels have zfs support. It may be less of a headache on those

          In tree fs don't break that way

      • nijave3 days ago
        You've been able to add and remove devices at will for a long time with btrfs (only recently supported in zfs with lots of caveats)

        Btrfs also supports async/offline dedupe

        You can also layer it on top of mdadm. Iirc zfs strongly discourages using anything but direct attached physical disks.

      • BSDobelix4 days ago
        >ZFS has license issues with Linux, preventing full integration

        No one wants that, openZFS is much healthier without Linux and it's "Foundation/Politics".

        • bhaney4 days ago
          > No one wants that

          I want that

          • BSDobelix4 days ago
            Then let me tell you that FreeBSD or OmniOS is what you really want ;)
            • bhaney4 days ago
              You're now 0 for 2 at telling me what I want
              • BSDobelix4 days ago
                The customer is not always right, however a good/modern Filesystem really would be something for Linux ;)
                • ruthmarx3 days ago
                  > The customer is not always right,

                  An uninvited door-to-door salesman is rarely, if ever right.

                  • BSDobelix3 days ago
                    HN is more like a Tupperware party. ;)
                    • ruthmarx2 days ago
                      Well then you ought to go somewhere more appreciative of your pitches ;)
      • LtdJorge3 days ago
        XFS is 22 and still the best in-tree FS there is :)
    • bayindirh4 days ago
      > How come that Windows still uses a 32 year old file system?

      Simple. Because most of the burden is taken by the (enterprise) storage hardware hosting the FS. Snapshots, block level deduplication, object storage technologies, RAID/Resiliency, size changes, you name it.

      Modern storage appliances are black magic, and you don't need much more features from NTFS. You either transparently access via NAS/SAN or store your NTFS volumes on capable disk boxes.

      On the Linux world, at the higher end, there's Lustre and GPFS. ZFS is mostly for resilient, but not performance critical needs.

      • BSDobelix4 days ago
        >ZFS is mostly for resilient, but not performance critical needs.

        Los Alamos disagrees ;)

        https://www.lanl.gov/media/news/0321-computational-storage

        But yes, in general you are right, Cern for example uses Ceph:

        https://indico.cern.ch/event/1457076/attachments/2934445/515...

        • bayindirh4 days ago
          I think what LLNL did predates GPUDirect and other new technologies came after 2022, but that's a good start.

          CERN's Ceph also for their "General IT" needs. Their clusters are independent from that. Also CERN's most processing is distributed across Europe. We are part of that network.

          Many, if not all of the HPC centers we talk with uses Lustre as their "immediate" storage. Also, there's Weka now, a closed source storage system supporting insane speeds and tons of protocols at the same time. Mostly used for and by GPU clusters around the world. You connect terabits to that cluster casually. It's all flash, and flat out fast.

          • ryao3 days ago
            Did you confuse LANL for LLNL?
            • bayindirh3 days ago
              It's just a typo, not a confusion, and I'm well beyond the edit window.
      • poisonborz4 days ago
        So private consumers should just pay cloud subscription if they want safer/modern data storage for their PC? (without NAS)
        • shrubble4 days ago
          No, private consumers have a choice, since Linux and FreeBSD runs well on their hardware. Microsoft is too busy shoveling their crappy AI and convincing OEMs to put a second Windows button (the CoPilot button) on their keyboards.
        • bluGill4 days ago
          Probably. There are levels of backups, and a cloud subscription SHOULD give you copies in geographical separate locations with someone to help you (who probably isn't into computers and doesn't want to learn the complex details) restore when (NOT IF!) needed.

          I have all my backups on a NAS in the next room. This covers the vast majority of use cases for backups, but if my house burns down everything is lost. I know I'm taking that risk, but really I should have better. Just paying someone to do it all in the cloud should be better for me as well and I keep thinking I should do this.

          Of course paying someone assumes they will do their job. There are always incompetent companies out there to take your money.

          • pdimitar4 days ago
            My setup is similar to yours, but I also distribute my most important data in compressed (<5GB) encrypted backups to several free-tier cloud storage accounts. I could restore it by copying one key and running one script.

            I lost faith in most paid operators. Whoops, this thing that absolutely can happen to home users and we're supposed to protect them from now actually happened to us and we were not prepared. We're so sorry!

            Nah. Give me access to 5-15 cloud storage accounts, I'll handle it myself. Have done so for years.

        • BSDobelix4 days ago
          If you need Windows, you can use something like restic (checksums and compression) and external drives (more than one, stored in more than one place) to make a backup. Plus "maybe" but not needed ReFS (on your non-Windows partition), which is included in the Workstation/Enterprise editions of Windows.

          I trust my own backups much more than any subscription, not essentially from a technical point of view, but from an access point of view (e.g. losing access to your Google account).

          EDIT: You have to enable check-summing and/or compression for data on ReFS manually

          https://learn.microsoft.com/en-us/windows-server/storage/ref...

          • bayindirh4 days ago
            > I trust my own backups much more than any subscription, not from a technical standpoint but from an access one (for example, losing access to your google account).

            I personally use cloud storage extensively, but I keep a local version with periodic rclone/borg. It allows me access from everywhere and sleep well at night.

          • qwertox4 days ago
            NTFS has Volume Shadow Copy, which is "good enough" for private users if they want to create image backups while their system is running.
            • BSDobelix4 days ago
              First of all, that's not a backup, that's a snapshot, and NO, that's not "good enough", tell your grandma that all her digitised pictures are gone because her hard drive exploded, or that one most important jpeg is now unwatchable because of bitrot.

              Just because someone is a private user doesn't mean that the data is less important, often it's quite the opposite, for example a family album vs your cloned git repository.

              • tjoff4 days ago
                ... VSS is used to create backups. Re-read parent.
                • BSDobelix3 days ago
                  Not good enough, you can make 10000 backups of bitrotten data, if you don't have check-sums on your block (zfs) or files (restic) nothing can help you. That's the same integrity as to copy stuff on your thump-drive.
                  • qwertox3 days ago
                    The same applies to those filesystems on Linux which don't check for bit-rottenness, which will be the majority of installs.

                    Your average grandma would use ext4 when using Linux. Android phones don't do that as well and I don't know about iOS, but apparently APFS only does metadata checksumming.

                    • BSDobelix3 days ago
                      >> if you don't have check-sums on your block (zfs) or files (restic)

                      ....

        • bayindirh4 days ago
          I think Microsoft has discontinued Windows 7 backup to force people to buy OneDrive subscriptions. They also forcefully enabled the feature when they first introduced it.

          So, I think that your answer for this question is "unfortunately, yes".

          Not that I support the situation.

        • NoMoreNicksLeft3 days ago
          Having a NAS is life-changing. Doesn't have to be some large 20-bay monstrosity, just something that will give you redundancy and has an ethernet jack.
        • j16sdiz3 days ago
          No, if they need ZFS-like function, they just pay for NAS.

          ZFS is not in the same market with AWS S3.

    • mustache_kimono4 days ago
      > I just don't get it how the Windows world - by far the largest PC platform per userbase - still doesn't have any answer to ZFS.

      The mainline Linux kernel doesn't either, and I think the answer is because it's hard and high risk with a return mostly measured in technical respect?

      • ffsm84 days ago
        Technically speaking, bcachefs has been merged into the Linux Kernel - that makes your initial assertion wrong.

        But considering it's had two drama events within 1 year of getting merged... I think we can safely confirm your conclusion of it being really hard

        • mustache_kimono4 days ago
          > Technically speaking, bcachefs has been merged into the Linux Kernel - that makes your initial assertion wrong.

          bcachefs doesn't implement its erasure coding/RAID yet? Doesn't implement send/receive. Doesn't implement scrub/fsck. See: https://bcachefs.org/Roadmap, https://bcachefs.org/Wishlist/

          btrfs is still more of a legit competitor to ZFS these days and it isn't close to touching ZFS where it matters. If the perpetually half-finished bcachefs and btrfs are the "answer" to ZFS that seems like too little, too late to me.

          • koverstreet4 days ago
            Erasure coding is almost done; all that's missing is some of the device evacuate and reconstruct paths, and people have been testing it and giving positive feedback (especially w.r.t. performance).

            It most definitely does have fsck and has since the beginning, and it's a much more robust and dependable fsck than btrfs's. Scrub isn't quite done - I actually was going to have it ready for this upcoming merge window except for a nasty bout of salmonella :)

            Send/recv is a long ways off, there might be some low level database improvements needed before that lands.

            Short term (next year or two) priorities are finishing off online fsck, more scalability work (upcoming version for this merge window will do 50PB, but now we need to up the limit on number of drives), and quashing bugs.

            • ryao4 days ago
              Hearing that it is missing some code for reconstruction makes it sound like it is missing something fairly important. The original purpose of parity RAID is to support reconstruction.
              • koverstreet3 days ago
                We can do reconstruct reads, what's missing is the code to rewrite missing blocks in a stripe after a drive dies.

                In general, due to the scope of the project, I've been prioritizing the functionality that's needed to validate the design and the parts that are needed for getting the relationships between different components correct.

                e.g. recently I've been doing a bunch of work on backpointers scalability, and that plus scrub are leading to more back and forth iteration on minor interactions with erasure coding.

                So: erasure coding is complete enough to know that it works and for people to torture test it, but yes you shouldn't be running it in production yet (and it's explicitly marked as such). What's remaining is trivial but slightly tedious stuff that's outside the critical path of the rest of the design.

                Some of the code I've been writing for scrub is turning out to also be what we want for reconstruct, so maybe we'll get there sooner rather than later...

            • BSDobelix4 days ago
              >except for a nasty bout of salmonella

              Did the Linux Foundation send you some "free" sushi? ;)

              However keep the good work rolling, super happy about a good, usable and modern Filesystem native to Linux.

            • pdimitar4 days ago
              FYI: the main reason I gave up on bcachefs is that I can't use devices with native 16K blocks.

              Hope that's coming this year. I have a bunch of old HDDs and SSDs and I could very easily assemble a spare storage server with about 4TB capacity. Already tested bcachefs with most of the drives and it performed very well.

              Also lack of ability to reconstruct seems like another worrying omission.

              • koverstreet4 days ago
                I wasn't aware there were actual users needing bs > ps yet. Cool :)

                That should be a completely trivial for bcachefs to support, it'll mostly just be a matter of finding or writing the tests.

                • pdimitar3 days ago
                  Seriously? But... NVMe drives! I stopped testing because I only have one spare NVMe and couldn't use it with bcachefs.

                  If you or others can get it done I'm absolutely starting to use bcachefs the month after. I do need fast storage servers in my home office.

                  • ryao3 days ago
                    You can do this on ZFS today with `zpool create -o ashift=14 ...`.
                    • pdimitar3 days ago
                      Yeah I know, thanks. But ZFS still mostly requires drives with the same sizes. My main NAS is like that but I can't expand it even though I want to, with drives of different sizes I have lying around, and I am not keen on spending for new HDDs right now. So I thought I'll make a secondary NAS with bcachefs and all the spare drives I have.

                      As for ZFS, I'll be buying some extra drives later this year and will make use of direct_io so I can use another NVMe spare for faster access.

                      • ryao3 days ago
                        If you don’t care about redundancy, you could add all of them as top level vdevs and then ZFS will happily use all of the space on them until one fails. Performance should be great until there is a failure. Just have good backups.
                        • pdimitar9 hours ago
                          Yep, that's sadly my current setup. Most of my data are not super critical.

                          When I can spend some $3000 or so I'll absolutely buy several 20 TB drives and just nail the whole thing -- and will use ZFS -- but for now the several spare HDDs that I want dedicated to my data are set up exactly as you mentioned: root vdevs with no redundancy. ZFS is mostly handling it fine even though the drives have vastly different speeds (and one of them is actually an SSD).

                          So yep ZFS can still do quite a lot, it's just still not flexible enough in a manner that f.ex. bcachefs is. But the latter is still missing important features so I am sticking with ZFS for a while still.

            • mafuy4 days ago
              Thank you, looking forward to it!
          • 4 days ago
            undefined
    • kwanbix4 days ago
      Honest question. As an end user that uses Windows and Linux and does not uses ZFS, what I am missing?
      • poisonborz4 days ago
        Way better data security, resilience against file rotting. This goes for both HDDs or SSDs. Copy-on-write, snapshots, end to end integrity. Also easier to extend the storage for safety/drive failure (and SSDs corrupt in a more sneaky way) with pools.
        • wil4214 days ago
          How many of us are using single disks on our laptops? I have a NAS and use all of the above but that doesn’t help people with single drive systems. Or help me understand why I would want it on my laptop.
          • ryao4 days ago
            My thinkpad from college uses ZFS as its rootfs. The benefits are:

              * If the hard drive / SSD corrupted blocks, the corruption would be identified.
              * Ditto blocks allow for self healing. Usually, this only applies to metadata, but if you set copies=2, you can get this on data too. It is a poor man’s RAID.
              * ARC made the desktop environment very responsive since unlike the LRU cache, ARC resists cold cache effects from transient IO workloads.
              * Transparent compression allowed me to store more on the laptop than otherwise possible.
              * Snapshots and rollback allowed me to do risky experiments and undo them as if nothing happened.
              * Backups were easy via send/receive of snapshots.
              * If the battery dies while you are doing things, you can boot without any damage to the filesystem.
            
            That said, I use a MacBook these days when I need to go outside. While I miss ZFS on it, I have not felt motivated to try to get a ZFS rootfs on it since the last I checked, Apple hardcoded the assumption that the rootfs is one of its own filesystems into the XNU kernel and other parts of the system.
            • rabf4 days ago
              Not ever having to deal with partitions and instead using data sets each of which can have their own properties such as compression, size quota, encryption etc is another benefit. Also using zfsbootmenu instead of grub enables booting from different datasets or snapshots as well as mounting and fixing data sets all from the bootloader!
              • artificialLimbs3 days ago
                Alright that's a bit mind blowing. TIL. Thank you. =)
            • CoolCold4 days ago
              NTFS had compression since mot even sure when.

              For other stuff, let that nerdy CorpIT handle your system.

              • ryao4 days ago
                NTFS compression is slow and has a low compression ratio. ZFS has both zstd and lz4.
              • adgjlsfhk14 days ago
                yes but NTFS is bad enough that no one needs to be told how bad it is.
          • yjftsjthsd-h4 days ago
            If the single drive in your laptop corrupts data, you won't know. ZFS can't fix corruption without extra copies, but it's still useful to catch the problem and notify the user.

            Also snapshots are great regardless.

            • Polizeiposaune3 days ago
              In some circumstances it can.

              Every ZFS block pointer has room for 3 disk addresses; by default, the extras are used only for redundant metadata, but they can also be used for user data.

              When you turn on ditto blocks for data (zfs set copies=2 rpool/foo), zfs can fix corruption even on single-drive systems at the cost of using double or triple the space. Note that (like compression), this only affects blocks written after the setting is in place, but (if you can pause writes to the filesystem) you can use zfs send|zfs recv to rewrite all blocks to ensure all blocks are redundant.

          • ekianjo4 days ago
            It provides encryption by default without having to deal with LUKS. And no need to ever do fsck again.
            • Twey4 days ago
              Except that swap on OpenZFS still deadlocks 7 years later (https://github.com/openzfs/zfs/issues/7734) so you're still going to need LUKS for your swap anyway.
              • ryao4 days ago
                Another option is to go without swap. I avoid swap on my machines unless I want hibernation support.
        • jeroenhd4 days ago
          The data security and rot resilience only goes for systems with ECC memory. Correct data with a faulty checksum will be treated the same as incorrect data with a correct checksum.

          Windows has its own extended filesystem through Storage Spaces, with many ZFS features added as lesser used Storage Spaces options, especially when combined with ReFS.

          • _factor4 days ago
            This has nothing to do with ZFS as a filesystem. It has integrity verification on duplicated raid configurations. If the system memory flips a bit, it will get written to disk like all filesystems. If a bit flips on a disk, however, it can be detected and repaired. Without ECC, your source of truth can corrupt, but this true of any system.
          • abrookewood4 days ago
            Please stop repeating this, it is incorrect. ECC helps with any system, but it isn't necessary for ZFS checksums to work.
          • BSDobelix4 days ago
            On zfs there is the ARC (adaptive read cache), on non-zfs systems this "read cache" is called buffer, both reside in memory, so ECC is equally important for both systems.

            Rot means changing bits without accessing those bits, and that's ~not possible with zfs, additionally you can enable check-summing IN the ARC (disabled by default), and with that you can say that ECC and "enterprise" quality hardware is even more important for non-ZFS systems.

            >Correct data with a faulty checksum will be treated the same as incorrect data with a correct checksum.

            There is no such thing as "correct" data, only a block with a correct checksum, if the checksum is not correct, the block is not ok.

          • mrb4 days ago
            "data security and rot resilience only goes for systems with ECC memory."

            No. Bad HDDs/SSDs or bad SATA cables/ports cause a lot more data corruption than bad RAM. And ZFS will correct these cases even without ECC memory. It's a myth that the data healing properties of ZFS are useless without ECC memory.

            • elseless3 days ago
              Precisely this. And don’t forget about bugs in virtualization layers/drivers — ZFS can very often save your data in those cases, too.
              • ryao3 days ago
                I once managed to use ZFS to detect a bit flip on a machine that did not have ECC RAM. All python programs started crashing in libpython.so on my old desktop one day. I thought it was a bug in ZFS, so I started debugging. I compared the in-memory buffer from ARC with the on-disk buffer for libpython.so and found a bit flip. At the time, accessing a snapshot through .zfs would duplicate the buffer in ARC, which made it really easy to compare the in-memory buffer against the on-disk buffer. I was in shock as I did not expect to ever see one in person. Since then, I always insist on my computers having ECC.
      • johannes12343214 days ago
        For a while I ran Open Solaris with ZFS as root filesystem.

        The key feature for me, which I miss, is the snapshotting integrated into the package manager.

        ZFS allows snapshots more or less for free (due to copy on weite) including cron based snapshotting every 15 minutes. So if I did a mistake anywhere there was a way to recover.

        And that integrated with the update manager and boot manager means that on an update a snapshot is created and during boot one can switch between states. Never had a broken update, but gave a good feeling.

        On my home server I like the raid features and on Solaris it was nicely integrated with NFS etc so that one can easily create volumes and export them and set restrictions (max size etc.) on it.

        • attentive3 days ago
          > is the snapshotting integrated into the package manager.

          some linux distros have that by default with btrfs. And usually it's a package install away if you're already on btrfs.

      • chillfox4 days ago
        Much faster launch of applications/files you use regularly. Ability to always rollback updates in seconds if they cause issues thanks to snapshots. Fast backups with snapshots + zfs send/receive to a remote machine. Compressed disks, this both let's you store more on a drive and makes accessing files faster. Easy encryption. ability to mirror 2 large usb disks so you never have your data corrupted or lose it from drive failures. Can move your data or entire os install to a new computer easily by using a live disk and just doing a send/receive to the new pc.

        (I have never used dedup, but it's there if you want I guess)

      • hoherd3 days ago
        Online filesystem checking and repair.

        Reading any file will tell you with 100% guarantee if it is corrupt or not.

        Snapshots that you can `cd` into, so you can compare any prior version of your FS with the live version of your FS.

        Block level compression.

        • snvzz3 days ago
          >Reading any file will tell you with 100% guarantee if it is corrupt or not.

          Only possible if it was not corrupted in RAM before it was written to disk.

          Using ECC memory is important, irrespective of ZFS.

      • e12e4 days ago
        Cross platform native encryption with sane fs for removable media.
        • lazide4 days ago
          Who would that help?

          MacOS also defaults to a non-portable FS for likely similar reasons, if one was being cynical.

          • e12e2 days ago
            It would help users using USB sticks, external drives?

            Couple it with encrypted zfs send/receive for cross platform secure backups.

            • lazide2 days ago
              I meant, why would they prioritize cross platform when it doesn’t help them?
      • madeofpalk4 days ago
        I'm missing file clones/copy-on-write.
      • wkat42424 days ago
        Snapshots (Note: NTFS does have this in the way of Volume Shadow Copy but it's not as easily accessible as a feature to the end user as it is in ZFS). Copy on Write for reliability under crashes. Block checksumming for data protection (bitrot)
    • zamadatix4 days ago
      NTFS was able to be extended in various way over the years to the point what you could do with an NTFS drive 32 years ago will feel like talking about a completely different filesystem than what you can do with it on current Windows.

      Honestly I really like ReFS, particularly in context of storage spaces, but I don't think it's relevant to Microsoft's consumer desktop OS where users don't have 6 drives they need to pool together. Don't get me wrong, I use ZFS because that's what I can get running on a Linux server and I'm not going to go run Windows Server just for the storage pooling... but ReFS + Storage Spaces wins my heart with the 256 MB slab approach. This means you can add+remove mixed sized drives and get the maximum space utilization for the parity settings of the pool. Here ZFS is still getting to online adds of same or larger drives 10 years later.

    • nickdothutton3 days ago
      OS development pretty much stopped around 2000. ZFS is from 2001. I don't count a new way to organise my photos or integrate with a search engine as "OS" though.
    • MauritsVB4 days ago
      There is occasional talk of moving the Windows implementation of OpenZFS (https://github.com/openzfsonwindows/openzfs/releases) into an officially supported tier, though that will probably come after the MacOS version (https://github.com/openzfsonosx) is officially supported.
    • doctorpangloss3 days ago
      The same reason file deduplication is not enabled for client Windows: greed.

      For example, there are numerous new file systems people use: OneDrive, Google Drive, iCloud Storage. Do you get it?

    • ryao4 days ago
      What do you mean by a ZFS compatibility layer? There is a Windows port:

      https://github.com/openzfsonwindows/openzfs

      Note that it is a beta.

    • badgersnake4 days ago
      NTFS is good enough for most people, who have a laptop with one SSD in it.
      • wkat42424 days ago
        The benefits of ZFS don't need multiple drives to be useful. I'm running ZFS on root for years now and snapshots have saved my bacon several times. Also with block checksums you can at least detect bitrot. And COW is always useful.
        • zamadatix4 days ago
          Windows manages volume snapshots on NTFS through VSS. I think ZFS snapshots are a bit "cleaner" of a design, and the tooling is a bit friendlier IMO, but the functionality to snapshot, rollback, and save your bacon is there regardless. Outside of the automatically enabled "System Restore" (which only uses VSS to snapshot specific system files during updates) I don't think anyone bothers to use it though.

          CoW, advanced parity, and checksumming are the big ones NTFS lacks. CoW is just inherently not how NTFS is designed and checksumming isn't there. Anything else (encryption, compression, snapshots, ACLs, large scale, virtual devices, basic parity) is done through NTFS on Windows.

          • wkat42423 days ago
            Yes I know that NTFS has snapshots, I mentioned that in another comment. I don't think NTFS is as relevant in comparison though. People who choose windows will have no interest in ZFS and vice versa (someone considering ZFS will not pick Windows).

            And I don't think anyone bothers to use it due to the lack of user-facing tooling around it. If it would be as easy to create snapshots as it is on ZFS, more people would use it, I'm sure. It's just so amazing to try something out, screw up my system and just revert :P But VSS is more of a system API than a user-facing geature.

            VSS is also used by backup software to quiet the filesystem by the way.

            But yeah the others are great features. My main point was though that almost all the features of ZFS are very beneficial even on a single drive. You don't need an array to take advantage of Snapshots, the crash reliability that CoW offers, and checksumming (though you will lack the repair option obviously)

            • EvanAnderson3 days ago
              > I don't think NTFS is as relevant in comparison though. People who choose windows will have no interest in ZFS and vice versa (someone considering ZFS will not pick Windows).

              ZFS on Windows, as a first-class supported-by-Microsoft option would be killer. It won't ever happen, but it would be great. (NTFS / VSS with filesystem/snapshot send/receive would "scratch" a lot of that "itch", too.)

              > And I don't think anyone bothers to use it due to the lack of user-facing tooling around it. If it would be as easy to create snapshots as it is on ZFS, more people would use it, I'm sure. It's just so amazing to try something out, screw up my system and just revert :P But VSS is more of a system API than a user-facing geature.

              VSS on NTFS is handy and useful but in my experience brittle compared to ZFS snapshots. Sometimes VSS just doesn't work. I've had repeated cases over the years where accessing a snapshot failed (with traditional unhelpful Microsoft error messages) until the host machine was rebooted. Losing VSS snapshots on a volume is much easier than trashing a ZFS volume.

              VSS straddles the filesystem and application layers in a way that ZFS doesn't. I think that contributes to some of the jank (VSS writers becoming "unstable", for example). It also straddles hardware interfaces in a novel way that ZFS doesn't (using hardware snapshot functionality-- somewhat like using a GPU versus "software rendering"). I think that also opens up a lot of opportunity for jank, as compared to ZFS treating storage as dumb blocks.

  • uniqueuid4 days ago
    It's good to see that they were pretty conservative about the expansion.

    Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.

    That said, there is one tiny caveat people should be aware of:

    > After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).

    • chungy4 days ago
      I'm not sure that's really a caveat, it just means old data might be in an inoptimal layout. Even with that, you still get the full benefits of raidzN, where up to N disks can completely fail and the pool will remain functional.
      • crote4 days ago
        I think it's a huge caveat, because it makes upgrades a lot less efficient than you'd expect.

        For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.

        • ryao4 days ago
          You have a couple options:

          1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.

          2. Use send/receive inside the pool.

          Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

          • pdimitar11 hours ago
            Can you give sample commands on how to achieve both options that you gave?
        • bmicraft4 days ago
          Well, when you start a raidz with 2 devices you've already done goofed. Start with a mirror or at least 3 devices.

          Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.

        • chungy4 days ago
          It still seems pretty minor. If you want extreme optimization, feel free to destroy the pool and create it new, or create it with the ideal layout from the beginning.

          Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.

      • stavros4 days ago
        Is that the case? What if I expand a 3-1 array to 3-2? Won't the old blocks remain 3-1?
        • Timshel4 days ago
          I don't believe it supports adding parity drives only data drives.
          • stavros4 days ago
            Ahh interesting, thanks.
            • bmicraft4 days ago
              Since preexisting blocks are kept at their current parity ratio and not modified (only redistributed among all devices), increasing the parity level of new blocks won't really be useful in practice anyway.
    • wjdp4 days ago
      Caveat is very much expected, you should expect ZFS features to not rewrite blocks. Changes to settings only apply to new data for example.
    • rekoil4 days ago
      Yaeh it's a pretty huge caveat to be honest.

          Da1 Db1 Dc1 Pa1 Pb1
          Da2 Db2 Dc2 Pa2 Pb2
          Da3 Db3 Dc3 Pa3 Pb3
          ___ ___ ___ Pa4 Pb4
      
      ___ represents free space. After expansion by one disk you would logically expect something like:

          Da1 Db1 Dc1 Da2 Pa1 Pb1
          Db2 Dc2 Da3 Db3 Pa2 Pb2
          Dc3 ___ ___ ___ Pa3 Pb3
          ___ ___ ___ ___ Pa4 Pb4
      
      But as I understand it it would actually expand to:

          Da1 Db1 Dc1 Dd1 Pa1 Pb1
          Da2 Db2 Dc2 Dd2 Pa2 Pb2
          Da3 Db3 Dc3 Dd3 Pa3 Pb3
          ___ ___ ___ ___ Pa4 Pb4
      
      Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.

      Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.

      • ryao4 days ago
        ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.

        The slides here explain how it works:

        https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

        Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

        • chungy3 days ago
          To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.

          I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.

          I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.

      • magicalhippo4 days ago
        Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.

        You can see this in the presentation[1] slides[2].

        The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.

        Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then

            Da1 Db1 Dc1 Pa1 Pb1
            Da2 Db2 Dc2 Pa2 Pb2
            Da3 Db3 Pa3 Pb3 ___
        
        would after RAID-Z expansion would become

            Da1 Db1 Dc1 Pa1 Pb1 Da2
            Db2 Dc2 Pa2 Pb2 Da3 Db3 
            Pa3 Pb3 ___ ___ ___ ___
        
        Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.

        However if the same data was written in the post-expanded vdev configuration, it would have become

            Da1 Db1 Dc1 Dd1 Pa1 Pb1
            Da2 Db2 Dc2 Dd2 Pa2 Pb2
            ___ ___ ___ ___ ___ ___
        
        Ie, you'd have 6 free blocks not just 4 blocks.

        Of course this doesn't count for writes which end up taking less than the maximal stripe width.

        [1]: https://www.youtube.com/watch?v=tqyNHyq0LYM

        [2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

        • ryao4 days ago
          Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.

          There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.

          • magicalhippo4 days ago
            What are the errors? I tried to show exactly what you talk about.

            edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.

            The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.

          • 4 days ago
            undefined
  • cgeier4 days ago
    This is huge news for ZFS users (probably mostly those in the hobbyist/home use space, but still). raidz expansion has been one of the most requested features for years.
    • jfreax4 days ago
      I'm not yet familiar with zfs and couldn't find it in the release note: Does expansion only works with disk of the same size? Or is adding are bigger/smaller disks possible or do all disk need to have the same size?
      • ryao4 days ago
        You can use different sized disks, but RAID-Z will truncate the space it uses to the lowest common denominator. If you increase the lowest common denominator, RAID-Z should auto-expand to use the additional space. All parity RAID technologies truncate members to the lowest common denominator, rather than just ZFS.
        • wrboyce2 days ago
          Is it definitely the LCD? Given drive of size 15 and 20 the LCD would be 1, no? I had assumed it would just use the size of the smallest drive on every drive (so 15+20->15+15=30). When I first read your comment I was thinking of GCF but even that would be fairly inefficient (GCF(15,20) = 5, so 15+20->5+5=10).
        • GauntletWizard3 days ago
          That's not entirely true, Unraid has mechanisms for unbalanced disks, but they come at a high cost in terms of usability by standard workloads.
      • shiroiushi4 days ago
        As far as I understand, ZFS doesn't work at all with disks of differing sizes (in the same array). So if you try it, it just finds the size of the smallest disk, and uses that for all disks. So if you put an 8TB drive in an array with a bunch of 10TB drives, they'll all be treated as 8TB drives, and the extra 2TB will be ignored on those disks.

        However, if you replace the smallest disk with a new, larger drive, and resilver, then it'll now use the new smallest disk as the baseline, and use that extra space on the other drives.

        (Someone please correct me if I'm wrong.)

        • mustache_kimono4 days ago
          > As far as I understand, ZFS doesn't work at all with disks of differing sizes (in the same array).

          This might be misleading, however, it may only be my understanding of word "array".

          You can use 2x10TB mirrors as vdev0, and 6x12TB in RAIDZ2 as vdev1 in the same pool/array. You can also stack as many unevenly sized disks as you want in a pool. The actual problem is when you want a different drive topology within a pool or vdev, or you want to mismatch, say, 3 oddly sized drives to create some synthetic redundancy level (2x4TB and 1x8TB to achieve two copies on two disks) like btrfs does/tries to do.

        • tw044 days ago
          This is the case with any parity based raid, they just hide it or lie to you in various ways. If you have two 6TB dives and two 12TB drives in a single raid-6 array, it is physically impossible to have two drive parity once you exceed 12TB of written capacity. BTRFS and bcachefs can’t magically create more space where none exists on your 6TB drives. They resort to dropping to mirror protection for the excess capacity which you could also do manually with ZFS by giving it partitions instead of the whole drive.
      • chasil4 days ago
        IIRC, you could always replace drives in a raidset with larger devices. When the last drive is replaced, then the new space is recognized.

        This new operation seems somewhat more sophisticated.

      • zelcon4 days ago
        You need to buy the same exact drive with the same capacity and speed. Your raidz vdev be as small and as slow as your smallest and slowest drive.

        btrfs and the new bcachefs can do RAID with mixed drives, but I can’t trust either of them with my data yet.

        • hda1114 days ago
          It doesn't have to be the same exact drive. Mixing drives from different manufacturers (with the same capacity) is often used to prevent correlated failure. ZFS is not using the whole disk, so different disks can be mixed, because the disk often have varying capacity.
        • tw044 days ago
          You can run raid-z across partitions to utilize the full drive just like synology does with their “hybrid raid” - you just shouldn’t.
        • Mashimo4 days ago
          > You need to buy the same exact drive

          AFAIK you can add larger and faster drives, you will just not get any benefits from it.

          • bpye4 days ago
            You can get read speed benefits with faster drives, but your writes will be limited by your slowest.
        • unixhero4 days ago
          Just have backups. I used btrfs and zfs for different purposes. Never had any lost data or downtime with btrfs since 2016. I only use raid 0 and raid 1 and compression. Btrfs does not havr a hungry ram requirement.
          • tw044 days ago
            Neither does zfs, that’s a widely repeated red herring from people trying to do dedup in the very early days, and people who misunderstood how it used ram to do caching.
          • zelcon3 days ago
            Tbh the idea of keeping backups defeats the purpose of using RAIDZ (especially RAIDZ3). I don’t want to buy an LTO drive, so if I backup, it’s either buying more HDDs or S3 Glacier ($$$). I like RAIDZ so I don’t have to buy so many drives. I guess it protects you if your house burns down, but how many people do offsite backups for their personal files? And dormant, unpowered HDDs die a lot faster than live, powered HDDs.
            • unixhero3 days ago
              Yes, seriously handling your data is expensive. I am talking about buying new hardrives.
              • 3 days ago
                undefined
  • FrostKiwi4 days ago
    FINALLY!

    You can do borderline insane single-vdev setups like RAID-Z3 with 4 disks (3 Disks worth of redundancy) of the most expensive and highest density hard drives money can buy right now, for an initial effective space usage of 25% and then keep buying and expanding Disk by Disk, with the space demand growing, up to something like 12ish disks. Disk prices dropping as time goes on and a spread out failure chance with disks being added at different times.

    • uniqueuid4 days ago
      Yes but see my sibling comment.

      When you expand your array, your existing data will not be stored any more efficiently.

      To get the new parity/data ratios, you would have to force copies of the data and delete the old, inefficient versions, e.g. with something like this [1]

      My personal take is that it's a much better idea to buy individual complete raid-z configurations and add new ones / replace old ones (disk by disk!) as you go.

      [1] https://github.com/markusressel/zfs-inplace-rebalancing

  • shepherdjerred3 days ago
    How does ZFS compare to btrfs? I'm currently using btrfs for my home server, but I've had some strange troubles with it. I'm thinking about switching to ZFS, but I don't want to end up in the same situation.
    • ryao3 days ago
      I first tried btrfs 15 years ago with Linux 2.6.33-rc4 if I recall. It developed an unlinkable file within 3 days, so I stopped using it. Later, I found ZFS. It had a few less significant problems, but I was a CS student at the time and I thought I could fix them since they seemed minor in comparison to the issue I had with btrfs, so over the next 18 months, I solved all of the problems that it had that bothered me and sent the patches to be included in the then ZFSOnLinux repository. My effort helped make it production ready on Linux. I have used ZFS ever since and it has worked well for me.

      If btrfs had been in better shape, I would have been a btrfs contributor. Unfortunately for btrfs, it not only was in bad shape back then, but other btrfs issues continued to bite me every time I tried it over the years for anything serious (e.g. frequent ENOSPC errors when there is still space). ZFS on the other hand just works. Myself and many others did a great deal of work to ensure it works well.

      The main reason for the difference is that ZFS had a very solid foundation, which was achieved by having some fantastic regression testing facilities. It has a userland version that randomly exercises the code to find bugs before they occur in production and a test suite that is run on every proposed change to help shake out bugs.

      ZFS also has more people reviewing proposed changes than other filesystems. The Btrfs developers will often state that there is a significant man power difference between the two file systems. I vaguely recall them claiming the difference was a factor of 6.

      Anyway, few people who use ZFS regret it, so I think you will find you like it too.

    • zie3 days ago
      ZFS has been in production use for almost 20 years now. BTRFS is not fully fit for production, according to BTRFS: https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5...

      Some simple use-cases are arguably production ready with BTRFS, YMMV.

    • parshimers3 days ago
      btrfs has similar aims to ZFS, but is far less mature. i used it for my root partitions due to it not needing DKMS, but had many troubles. i used it in a fairly simple way, just a mirror. one day, of the drives in the array started to have issues- and btrfs fell on it's face. it remounted everything read-only if i remember correctly, and would not run in degraded mode by default. even mdraid would do better than this without checksumming and so forth. ZFS also likewise, says that the array is faulted, but of course allows it to be used. the fact the default behavior was not RAID, because it's literally missing the R part for reading the data back, made me lose any faith in it. i moved to ZFS and haven't had issues since. there is much more of a community and lots of good tooling around it.
    • nnadams3 days ago
      I used Btrfs for a few years but switched away a couple years ago. I also had one or two incidents with Btrfs where some weirdness happened, but I was able to recover everything in the end. Overall I liked the flexibility of Btrfs, but mostly I found it too slow.

      I use ZFS on Arch Linux and overall have had no problems with it so far. There's more customization and methods to optimize performance. My one suggestion is to do a lot of research and testing with ZFS. There is a bit of a learning curve, but it's been worth the switch for me.

  • jakedata4 days ago
    Happy to see the ARC bypass for NVMe performance. ZFS really fails to exploit NVMe's potential. Online expansion might be interesting. I tried to use ZFS for some very busy databases and ended up getting bitten badly by the fragmentation bug. The only way to restore performance appears to be copying the data off the volume, nuking it and then copying it back. Now -perhaps- if I expand the zpool then I might be able to reduce fragmentation by copying the tablespace on the same volume.
  • wkat42424 days ago
    Note: This is online expansion. Expansion was always possible but you did need to take the array down to do it. You could also move to bigger drives but you also had to do that one at a time (and only gain the new capacity once all drives were upgraded of course)

    As far as I know shrinking a pool is still not possible though. So if you have a pool with 5 drives and add a 6th, you can't go back to 5 drives even if there is very little data in it.

  • averageRoyalty4 days ago
    Worth noting that TrueNAS already supports this[0] (I assuming using 2.3.0rc3?). Not sure about the stability, but very exciting.

    https://www.truenas.com/blog/electric-eel-openzfs-23/

  • endorphine4 days ago
    Can someone describe why they would use ZFS (or similar) for home usage?
    • mrighele4 days ago
      Good reasons for me:

      Checksums: this is even more important in home usage as the hardware is usually of lower quality. Faulty controllers, crappy cables, hard disks stored in a higher than advised temperature... many reasons for bogus data to be saved, and zfs handles that well and automatically (if you have redundancy)

      Snapshots: very useful to make backups and quickly go back to an older version of a file when mistakes are made

      Ease of mind: compared to the alternatives, I find that zfs is easier to use and makes it harder to make a mistake that could bring data loss (e.g. remove by mistake the wrong drive when replacing a faulty one, pool becomes unusable, "ops!", put the disk back, pool goes back to work as nothing happened). Maybe it is different now with mdadm, ma when I used it years ago I was always worried to make a destructive mistake.

      • EvanAnderson3 days ago
        > Snapshots: very useful to make backups and quickly go back to an older version of a file when mistakes are made

        Piling on here: Sending snapshots to remote machines (or removable drives) is very easy. That makes snapshots viable as a backup mechanism (because they can exist off-site and offline).

    • ryao4 days ago
      To give an answer that nobody else has given, ZFS is great for storing Steam games. Set recordsize=1M and compression=zstd and you can often store about 33% more games in the same space.

      A friend uses ZFS to store his Steam games on a couple of hard drives. He gave ZFS a SSD to use as L2ARC. ZFS automatically caches the games he likes to run on the SSD so that they load quickly. If he changes which games he likes to run, ZFS will automatically adapt to cache those on the SSD instead.

      • chillfox4 days ago
        The compression and ARC will make games load much master than they would on NTFS even without having a separate drive for the ARC.
      • bmicraft3 days ago
        As I understand, L2ARC doesn't work across reboots which unfortunately makes it almost useless for systems that get rebooted regularly, like desktops.
        • olavgg3 days ago
          L2ARC has had persistence support for a few years now.
          • bmicraft2 days ago
            Wow thanks for pointing that out, apparently it's been around for four years since with the first 2.0 release without me noticing.
    • chromakode4 days ago
      I replicate my entire filesystem to a local NAS every 10 minutes using zrepl. This has already saved my bacon once when a WD_BLACK SN850 suddenly died on me [1]. It's also recovered code from some classic git blunders. It shouldn't be possible any more to lose data to user error or single device failure. We have the technology.

      [1]: https://chromakode.com/post/zfs-recovery-with-zrepl/

    • vedranm4 days ago
      Several reasons, but major ones (for me) are reliability (checksums and self-healing) and portability (no other modern filesystem can be read and written on Linux, FreeBSD, Windows, and macOS).

      Snapshots ("boot environments") are also supported by Btrfs (my Linux installations use that so I don't have to worry about having the 3rd party kernel module to read my rootfs). Performance isn't that great either and, assuming Linux, XFS is a better choice if that is your main concern.

    • Mashimo4 days ago
      It's relatively easy, and yet powerful. Before that I had MDADM + LVM + dm-crypt + ext4, which also worked but all the layers got me into a headache.

      Automated snapshots are super easy and fast. Also easy to access if you deleted a file, you don't have to restore the whole snapshot, you can just cp from the hidden .zfs/ folder.

      I run it on 6x 8TB disk for a couple of years now. I run it in a raidz2, which means up to 2 disk can die. Would I use it on a single disk on a Desktop? Probably not.

      • redundantly4 days ago
        > Would I use it on a single disk on a Desktop? Probably not.

        I do. Snapshots and replication and checksumming are awesome.

    • PaulKeeble4 days ago
      I have a home built NAS that uses ZFS for the storage array and the checksumming has been really quite useful in detecting and correcting bit rot. In the past I used MDADM and EXT over the top and that worked but it didn't defend against bit rot. I have considered BTRFS since it would get me the same checksumming without the rest of ZFS but its not considered reliable for systems with parity yet (although now I think it likely is more than reliable enough now).

      I do occasionally use snapshots and the compression feature is handy on quite a lot of my data set but I don't use the user and group limitations or remote send and receive etc. ZFS does a lot more than I need but it also works really well and I wouldn't move away from a checksumming filesystem now.

    • lutorm4 days ago
      Apart from just peace of mind from bitrot, I use it for the snapshotting capability which makes it super easy to do backups. You can snapshot and send the snapshots to other storage with e.g zfs-autobackup and it's trivial and you can't screw it up. If the snapshots exist on the other drive, you know you have a backup.
    • mshroyer4 days ago
      I use it on a NAS for:

      - Confidence in my long-term storage of some data I care about, as zpool scrub protects against bit rot

      - Cheap snapshots that provide both easy checkpoints for work saved to my network share, and resilience against ransomware attacks against my other computers' backups to my NAS

      - Easy and efficient (zfs send) replication to external hard drives for storage pool backup

      - Built-in and ergonomic encryption

      And it's really pretty easy to use. I started with FreeNAS (now TrueNAS), but eventually switched to just running FreeBSD + ZFS + Samba on my file server because it's not that complicated.

    • klauserc4 days ago
      I use it on my work laptop. Reasons:

      - a single solution that covers the entire storage domain (I don't have to learn multiple layers, like logical volume manager vs. ext4 vs. physical partitions) - cheap/free snapshots. I have been glad to have been able to revert individual files or entire file systems to an earlier state. E.g., create a snapshot before doing a major distro update. - easy to configure/well documented

      Like others have said, at this point I would need a good reason, NOT to use ZFS on a system.

    • NamTaf4 days ago
      I used it on my home NAS (4x3TB drives, holding all of my family's backups, etc.) for the data security / checksumming features. IMO it's performant, robust and well-designed in ways that give me reassurance regarding data integrity and help prevent me shooting myself in the foot.
    • tbrownaw4 days ago
      > describe why they would use ZFS (or similar) for home usage

      Mostly because it's there, but also the snapshots have a `diff` feature that's occasionally useful.

    • nesarkvechnep4 days ago
      I'm trying to find a reason not to use ZFS at home.
      • dizhn4 days ago
        Requirement for enterprise quality disks, huge RAM (1 gig per TB), ECC, at least x5 disks of redundancy. None of these are things, but people will try to educate you anyway. So use it but keep it to yourself. :)
        • craftkiller4 days ago
          No need to keep it to yourself. As you've mentioned, all of these requirements are misinformation so you can ignore people who repeat them (or even better, tell them to stop spreading misinformation).

          For those not in the know:

          You don't need to use enterprise quality disks. There is nothing in the ZFS design that requires enterprise quality disks any more than any other file system. In fact, ZFS has saved my data through multiple consumer-grade HDD failures over the years thanks to raidz.

          The 1 gig per TB figure is ONLY for when using the ZFS dedup feature, which the ZFS dedup feature is widely regarded as a bad idea except in VERY specific use cases. 99.9% of ZFS users should not and will not use dedup and therefore they do not need ridiculous piles of ram.

          There is nothing in the design of ZFS any more dangerous to run without ECC than any other filesystem. ECC is a good idea regardless of filesystem but its certainly not a requirement.

          And you don't need x5 disks of redundancy. It runs great and has benefits even on single-disk systems like laptops. Naturally, having parity drives is better in case a drive fails but on single disk systems you still benefit from the checksumming, snapshotting, boot environments, transparent compression, incremental zfs send/recv, and cross-platform native encryption.

          • JZerf3 days ago
            One reason why it might be a good idea to use higher quality drives when using ZFS is because it seems like in some scenarios ZFS can result in more writes being done to the drive than when other file systems are used. This can be a problem for some QLC and TLC drives that have low endurance.

            I'm in the process of setting up a server at home and was testing a few different file systems. I was doing a test where I had a program continuously synchronously writing just a single byte every second (like might happen for some programs that are writing logs fairly continuously). For most of my tests I was just using the default settings for each file system. When using ext4 this resulted in 28 KB/s of actual writes being done to the drive which seems reasonable due to 4 KB blocks needing to be written, journaling, writing metadata, etc... BTRFS generated 68 KB/s of actual writes which still isn't too bad. When using ZFS about the best I could get it to do after trying various settings for volblocksize, ashift, logbias, atime, and compression settings still resulted in 312 KB/s of actual writes being done to the drive which I was not pleased with. At the rate ZFS was writing data, over a 10 year span that same program running continuously would result in about 100 TB of writes being done to the drive which is about a quarter of what my SSD is rated for.

            • craftkiller2 days ago
              One knob you could change that should radically alter that is zfs_txg_timeout which is how many seconds ZFS will accumulate writes before flushing them out to disk. The default is 5 seconds, but I usually increase mine to 20. When writing a lot of data, it'll get flushed to disk more often, so this timer is only for when you're writing small amounts of data like the test you just described.

              > like might happen for some programs that are writing logs fairly continuously

              On Linux, I think journald would be aggregating your logs from multiple services so at least you wouldn't be incurring that cost on a per-program basis. On FreeBSD with syslog we're doomed to separate log files.

              > over a 10 year span that same program running continuously would result in about 100 TB of writes being done to the drive which is about a quarter of what my SSD is rated for

              I sure hope I've upgraded SSDs by the year 2065.

              • JZerf2 days ago
                > One knob you could change that should radically alter that is zfs_txg_timeout which is how many seconds ZFS will accumulate writes before flushing them out to disk.

                I don't believe that zfs_txg_timeout setting would make much of a difference for the test I described where I was doing synchronous writes.

                > On Linux, I think journald would be aggregating your logs from multiple services so at least you wouldn't be incurring that cost on a per-program basis.

                The server I'm setting up will be hosting several VMs running a mix of OSes and distros and running many types types of services and apps. Some of the logging could be aggregated but there will be multiple types of I/O (various types of databases, app updates, file server, etc...) and I wanted to get an idea of how much file system overhead there might be in a worst case kind of scenario.

                > I sure hope I've upgraded SSDs by the year 2065.

                Since I'll be running a lot of stuff on the server, I'll probably have quite a bit more writing going on than the test I described so if I used ZFS I believe the SSD could reach its rated endurance in just several years.

              • dizhn2 days ago
                >I sure hope I've upgraded SSDs by the year 2065.

                My mind jumped at that too when I first read parent's comment. But presumably he's writing other files to disk too. Not just that one file. :)

                • JZerf2 days ago
                  > But presumably he's writing other files to disk too. Not just that one file.

                  Yes, there will be much more going on than the simple test I was doing. The server will be hosting several VMs running a mix of OSes and distros and running many types types of services and apps.

          • bbatha3 days ago
            > The 1 gig per TB figure is ONLY for when using the ZFS dedup feature, which the ZFS dedup feature is widely regarded as a bad idea except in VERY specific use cases. 99.9% of ZFS users should not and will not use dedup and therefore they do not need ridiculous piles of ram.

            You also really don't need a 1GB for RAM unless you have a very high write volume. YMMV but my experience is that its closer to 1GB for 10TB.

        • tpetry4 days ago
          The interesting part about the enterprise quality disk misinformation is how so wrong it is. The core idea of ZFS was to detect issues when those drives or their drivers are faulty. And this was more happening with cheap non-enterprise disks at that time.
    • zbentley4 days ago
      I use ZFS for boot and storage volumes on my main workstation, which is primarily that--a workstation, not a server or NAS. Some benefits:

      - Excellent filesystem level backup facility. I can transfer snapshots to a spare drive, or send/receive to a remote (at present a spare computer, but rsync.net looks better every year I have to fix up the spare).

      - Unlike other fs-level backup solutions, the flexibility of zvols means I can easily expand or shrink the scope of what's backed up.

      - It's incredibly easy to test (and restore) backups. Pointing my to-be-backed-up volume, or my backup volume, to a previous backup snapshot is instant, and provides a complete view of the filesystem at that point in time. No "which files do you want to restore" hassles or any of that, and then I can re-point back to latest and keep stacking backups. Only Time Machine has even approached that level of simplicity in my experience, and I have tried a lot of backup tools. In general, backup tools/workflows that uphold "the test process is the restoration process, so we made the restoration process as easy and reversible as possible" are the best ones.

      - Dedup occasionally comes in useful (if e.g. I'm messing around with copies of really large AI training datasets or many terabytes of media file organization work). It's RAM-expensive, yes, but what's often not mentioned is that you can turn it on and off for a volume--if you rewrite data. So if I'm looking ahead to a week of high-volume file wrangling, I can turn dedup on where I need it, start a snapshot-and-immediately-restore of my data (or if it's not that many files, just cp them back and forth), and by the next day or so it'll be ready. Turning it off when I'm done is even simpler. I imagine that the copy cost and unpredictable memory usage mean that this kind of "toggled" approach to dedup isn't that useful for folks driving servers with ZFS, but it's outstanding on a workstation.

      - Using ZFSBootMenu outside of my OS means I can be extremely cavalier with my boot volume. Not sure if an experimental kernel upgrade is going to wreck my graphics driver? Take a snapshot and try it! Not sure if a curl | bash invocation from the internet is going to rm -rf /? Take a snapshot and try it! If my boot volume gets ruined, I can roll it back to a snapshot in the bootloader from outside of the OS. For extra paranoia I have a ZFSBootMenu EFI partition on a USB drive if I ever wreck the bootloader as well, but the odds are that if I ever break the system that bad the boot volume is damaged at the block level and can't restore local snapshots. In that case, I'd plug in the USB drive and restore a snapshot from the adjacent data volume, or my backup volume ... all without installing an OS or leaving the bootloader. The benefits of this to mental health are huge; I can tend towards a more "college me" approach to trying random shit from StackOverflow for tweaking my system without having to worry about "adult professional me" being concerned that I don't know what running some random garbage will do to my system. Being able to experiment first, and then learn what's really going on once I find what works, is very relieving and makes tinkering a much less fraught endeavor.

      - Being able to per-dataset enable/disable ARC and ZIL means that I can selectively make some actions really fast. My Steam games, for example, are in a high-ARC-bias dataset that starts prewarming (with throttled IO) in the background on boot. Game load times are extremely fast--sometimes at better than single-ext4-SSD levels--and I'm storing all my game installs on spinning rust for $35 (4x 500GB + 2x 32GB cheap SSD for cache)!

      • E39M5S623 days ago
        It's great to hear that you're using ZFSBootMenu the way I envisioned it! There's such a sense of relief and freedom having snapshots of your whole OS taken every 15 minutes.

        One thing that you might not be aware of is that you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab. Keep in mind though that you can only have one checkpoint at a time, they keep growing and growing, and a rollback is for EVERYTHING on the pool.

        • zbentley3 days ago
          Oh, are you zdykstra? If so, thanks for creating an invaluable tool!

          > you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab

          Good to know! Snapshots meet most of my needs at present (since my boot volume is a single fast drive, snapshots ~~ checkpoints in this case), but I could see this coming in useful for future scenarios where I need to do complex or risky things with data volumes or SAN layout changes.

  • senectus14 days ago
    Would love to use ZFS, but unfortunately Fedora just cant keep up with it...
    • kawsper4 days ago
      Not sure if it helps you at all, but I have a simple Ruby script that I use to build kernels on Fedora with a specified ZFS version.

      https://github.com/kaspergrubbe/fedora-kernel-compilation/bl...

      It builds on top of the exploded fedora kernel tree, adds zfs and spits out a .rpm that you can install with rpm -ivh.

      It doesn't play well with dkms because it tries to interfere, so I disable it on my system.

      • _factor4 days ago
        I could never getting working on rpm-ostree distros.
    • klauserc4 days ago
      I've been running Fedora on top of the excellent ZFSBootMenu[1] for about a year. You need to pay attention to the kernel versions supported by OpenZFS and might have to wait for support for a couple of weeks. The setup works fine otherwise.

      [1] https://docs.zfsbootmenu.org

    • vedranm4 days ago
      If you delay upgrading the kernel on occasions, it is more or less fine.
  • zelcon4 days ago
    Been running it since rc2. It’s insane how long this took to finally ship.
  • abrookewood4 days ago
    Can someone provide details on this bit please? "Direct IO: Allows bypassing the ARC for reads/writes, improving performance in scenarios like NVMe devices where caching may hinder efficiency".

    ARC is based in RAM, so how could it reduce performance when used with NVMe devices? They are fast, but they aren't RAM-fast ...

    • nolist_policy4 days ago
      Because with a (ARC) cache you have to copy from the app to the cache and then dma to disk. With direct io you can dma directly from the app ram to the disk.
    • philjohn4 days ago
      Yes - interested in this too. Is this for both ARC and L2ARC, or just L2ARC?
  • happosai4 days ago
    The annual reminder that if Oracle wanted to contribute positively to the Linux ecosystem, they would update the CDDL license ZFS uses to GPL compatible.
    • ryao4 days ago
      This is the annual reply that Oracle cannot change the OpenZFS license because OpenZFS contributors removed the “or any later version” part of the license from their contributions.

      By the way, comments such as yours seem to assume that Oracle is somehow involved with OpenZFS. Oracle has no connection with OpenZFS outside of owning copyright on the original OpenSolaris sources and a few tiny commits their employees contributed before Oracle purchased Sun. Oracle has its own internal ZFS fork and they have zero interest in bringing it to Linux. They want people to either go on their cloud or buy this:

      https://www.oracle.com/storage/nas/

      • jeroenhd4 days ago
        Is there a reason the OpenZFS contributors don't want to dual-license their code? I'm not too familiar with the CDDL but I'm not sure what advantage it brings to an open source project compared to something like GPL? Having to deal with DKMS is one of the reasons why I'm sticking with BTRFS for doing ZFS-like stuff.
        • ryao4 days ago
          The OpenZFS code is based on the original OpenSolaris code, and the license used is the CDDL because that is what OpenSolaris used. Dual licensing that requires the current OpenSolaris copyright holder to agree. That is unlikely without writing a very big check. Further speculation is not a productive thing to do, but since I know a number of people assume that OpenSolaris copyright holder is the only one preventing this, let me preemptively say that it is not so simple. Different groups have different preferred licenses. Some groups cannot stand certain licenses. Other groups might detest the idea of dual licensing in general since it causes community fragmentation whenever contributors decide to publish changes only under 1 of the 2 licenses.

          The CDDL was designed to ensure that if Sun Microsystems were acquired by a company hostile to OSS, people could still use Sun’s open source software. In particular, the CDDL has an explicit software patent grant. Some consider that to have been invaluable in preempting lawsuits from a certain company that would rather have ZFS be closed source software.

    • MauritsVB4 days ago
      Oracle changing the license would not make a huge difference to OpenZFS.

      Oracle only owns the copyright to the original Sun Microsystems code. It doesn’t apply to all ZFS implementations (probably not OracleZFS, perhaps not IllumosZFS) but in the specific case of OpenZFS the majority of the code is no longer Sun code.

      Don’t forget that SunZFS was open sourced in 2005 before Oracle bought Sun Microsystems in 2009. Oracle have created their own closed source version of ZFS but outside some Oracle shops nobody uses it (some people say Oracle has stopped working on OracleZFS all together some time ago).

      Considering the forks (first from Sun to the various open source implementations and later the fork from open source into Oracle's closed source version) were such a long time ago, there is not that much original code left. A lot of storage tech, or even entire storage concepts, did not exist when Sun open sourced ZFS. Various ZFS implementations developed their own support for TRIM, or Sequential Resilvering, or Zstd compression, or Persistent L2ARC, or Native ZFS Encryption, or Fusion Pools, or Allocation Classes, or dRAID, or RAIDZ expansion long after 2005. That's is why the majority of the code in OpenZFS 2 is from long after the fork from Sun code twenty years ago.

      Modern OpenZFS contains new code contributions from Nexenta Systems, Delphix, Intel, iXsystems, Datto, Klara Systems and a whole bunch of other companies that have voluntarily offered their code when most of the non-Oracle ZFS implementations merged to become OpenZFS 2.0.

      If you'd want to relicense OpenZFS you could get Oracle to agree for the bit under Sun copyright but for the majority of the code you'd have to get a dozen or so companies to agree to relicensing their contributions (probably not that hard) and many hundreds of individual contributors over two decades (a big task and probably not worth it).

    • abrookewood4 days ago
      The only thing Oracle wants to "contribute positively to" is Larry's next yacht.
    • somat3 days ago
      Honestly the cddl being incompatible with the gpl is one of the weirder statements to come out of the fsf. It comes up every time the cddl is mentioned but no one really knows why they are incompatible, it is basically "the fsf says they are incompatible" and when really pressed, they dithered until 2016 then came up with some hand waving that the incompatibility is some minutia as to what scope each license applies to.

      The whole thing smells of some FSF agenda to me. if you ship a cddl file in your gpl project it is still a gpl licensed project and the cddl file is still a cddl licensed file.

  • bitmagier4 days ago
    Marvelous!