[1]: https://arstechnica.com/gadgets/2016/06/zfs-the-other-new-ap...
[2]: https://ahl.dtrace.org/2016/06/19/apfs-part1/
[3]: https://arstechnica.com/gadgets/2016/06/a-zfs-developers-ana...
> HFS improved upon the Macintosh File System by adding—wait for it—hierarchy! No longer would files accumulate in a single pile; you could organize them in folders.
MFS did allow you to organize your files into folders, but on-disk they were represented as a single list of files with unique filenames - meaning you could have 'resume.txt' in a folder called 'Jan's Docs' but you couldn't also have 'resume.txt' in a folder called 'Jake's Docs' - every file on the disk needed to have a unique filename.
Not so much an issue in the days of 400KB floppy drives, but once people started getting 20 MB hard drives that was going to be an unacceptable limitation.
The other major benefit of HFS was that it stored file data in a B-tree, which allowed directory information to be stored effectively hierarchically, meaning you could find a directory's contents very quickly. With MFS, every file being stored in a single list meant that any time you wanted to get a directory's contents you had to read through that list of every file on the disk to see which ones were stored in that directory, so every listing of any directory was O(n) for the total number of files on the disk.
However, I can say, every time I've tried ZFS on my iMac, it was simply a disaster.
Just trying to set it up on a single USB drive, or setting it up to mirror a pair. The net effect was that it CRUSHED the performance on my machine. It became unusable. We're talking "move the mouse, watch the pointer crawl behind" unusable. "Let's type at 300 baud" unusable. Interactive performance was shot.
After I remove it, all is right again.
Since Apple was already integrating custom SSD controllers onto their A series SOCs, presumably the purchase was about Anobit patents.
> Anobit appears to be applying a lot of signal processing techniques in addition to ECC to address the issue of NAND reliability and data retention... promising significant improvements in NAND longevity and reliability. At the high end Anobit promises 50,000 p/e cycles out of consumer grade MLC NAND
https://www.anandtech.com/show/5258/apple-acquires-anobit-br...
Apple has said in the past that they are addressing improved data stability at the hardware level, presumably using those acquired patents.
> Explicitly not checksumming user data is a little more interesting. The APFS engineers I talked to cited strong ECC protection within Apple storage devices... The devices have a bit error rate that's low enough to expect no errors over the device's lifetime.
https://arstechnica.com/gadgets/2016/06/a-zfs-developers-ana...
That's the fault of macOS, I also experienced 100% CPU and load off the charts and it was kernel_task jammed up by USB. Once I used a Thunderbolt enclosure it started to be sane. This experience was the same across multiple non-Apple filesystems as I was trying a bunch to see which one was the best at cross-os compatibility
Also, separately, ZFS says "don't run ZFS on USB". I didn't have problems with it, but I knew I was rolling the dice
Anyway only bringing it up to reinforce that it is probably a macOS problem.
ZFS on Linux.
It depends on the format. A BMP image format would limit the damage to 1 pixel, while a JPEG could propagate the damage to potentially the entire image. There is an example of a bitflip damaging a picture here:
https://arstechnica.com/information-technology/2014/01/bitro...
That single bit flip ruined about half of the image.
As for video, that depends on how far apart I frames are. Any damage from a bit flip would likely be isolated to the section of video from the bitflip until the next I-frame occurs. As for how bad it could be, it depends on how the encoding works.
> On the one hand, potentially, "very" robust.
Only in uncompressed files.
> But on the other, I would think that there are some very special bits that if toggled can potentially "ruin" the entire file. But I don't know.
The way that image compression works means that a single bit flip prior to decompression can affect a great many pixels, as shown at Ars Technica.
> However, I can say, every time I've tried ZFS on my iMac, it was simply a disaster.
Did you file an issue? I am not sure what the current status of the macOS driver’s production readiness is, but it will be difficult to see it improve if people do not report issues that they have.
But I agree with you about the most important feature that Apple failed on.
I lament Apple's failure to secure ZFS as much as the next Mac user, but I understood that ship to have sailed when it did, around a decade ago.
But as a desktop user, I almost threw up in my mouth when APFS came out with no checksums for user data. I'd assumed if they were going to pass on ZFS they would reimplement the most critical feature. But they didn't. I know I've lost data because of it, and if you have a few TB of data, I suspect you have, too.
Not lost permanently, because I had manually-checksummed backups. But lost in the sense that the copies of the data made on macOS were corrupted and different, and the OS/filesystem simply didn't notice. Oops!
Once you've used ZFS, that's unforgivable. I mean, ZFS has had some bugs, over the years (well mainly just last year lol), and a small number of users have lost data due to that. But to silently lose data by design? Fuck all the way off. It's 2025, not 1985.
So while APFS has some cool modern features (or at least two), it's an embarassment. (And all my non-ephemeral storage is now on Linux.)
I think some power users would appreciate an equivalent to zfs send/recv.
I run ZFS on my main server at home (Proxmox: a Linux hypervisor based on Debian and Proxmox ships with ZFS) but...
No matter the FS, for "big" files that aren't supposed to change, I append a (partial) cryptographic checksum to the filename. For example:
20240238-familyTripBari.mp4 becomes 20240238-familyTripBari-b3-8d77e2419a36.mp4 where "-b3-" indicates the type of cryptographic hash ("b3" for Blake3 in my case for it's very fast) and 8d77e2419a36 is the first x hexdigits of the cryptographic hash.
I play the video file (or whatever file it is) after I added the checksum: I know it's good.
I do that for movies, pictures, rips of my audio CDs (although these ones are matched with a "perfect rips" online database too), etc. Basically with everything that isn't supposed to change and that I want to keep.
I then have a shell script (which I run on several machines) that uses random sampling where I pick the percentage of files that have such a cryptographic checksum in their filenames that I want to check and that verifies that each still has its checksum matching. I don't verify 100% of the files all the time. Typically I'll verify, say, 3% of my files, randomly, daily.
Does it help? Well sure yup. For whatever reason one file was corrupt on one of my system: it's not too clear why for the file had the correct size but somehow a bit had flipped. During some sync probably. And my script caught it.
The nice thing is I can copy such files on actual backups: DVDs or BluRays or cloud or whatever. The checksum is part of the filename, so I know if my file changed or not no matter the OS / backup medium / cloud or local storage / etc.
If you have "bit flip anxiety", it helps ; )
If you already have some data on ext4 disk(s) and don't want to deal with the the issues of using ZFS/BTRFS then it's a no brainer. Dynamically resizing the "array" is super simple and it works really well with MergerFS.
This approach is also good when you have multiple sources to restore from. It makes it easier to determine what is the new "source of truth."
Theres something to be said too for backing up onto different FS too. You don't want to be stung by a FS bug and if you do then it's good to know about it.
Unlike e.g. KDFs, checksums are built to be performant, so that verifying one is a relatively fast operation. The Blake family is about 8 cycles per byte[1], I guess a modern CPU could do [napkin math] some 500-1000 MB per second? Perhaps I'm off by an order of magnitude or two, but if the file in question is precious enough, maybe that's worth a shot?
Apple was already working to integrate ZFS when Oracle bought Sun.
From TFA:
> ZFS was featured in the keynotes, it was on the developer disc handed out to attendees, and it was even mentioned on the Mac OS X Server website. Apple had been working on its port since 2006 and now it was functional enough to be put on full display.
However, once Oracle bought Sun, the deal was off.
Again from TFA:
> The Apple-ZFS deal was brought for Larry Ellison's approval, the first-born child of the conquered land brought to be blessed by the new king. "I'll tell you about doing business with my best friend Steve Jobs," he apparently said, "I don't do business with my best friend Steve Jobs."
And that was the end.
At the time that NetApp filed its lawsuit I blogged about how ZFS was a straightforward evolution of BSD 4.4's log structured filesystem. I didn't know that to be the case historically, that is, I didn't know if Bonwick was inspired by LFS, but I showed how almost in every way ZFS was a simple evolution of LFS. I showed my blog to Jeff to see how he felt about it, and he didn't say much but he did acknowledge it. The point of that blog was to show that there was prior art and that NetApp's lawsuit was worthless. I pointed it out to Sun's general counsel, too.
Also, according to NetApp, "Sun started it".
https://www.networkcomputing.com/data-center-networking/neta...
[0] https://www.theregister.com/2010/09/09/oracle_netapp_zfs_dis...
Your own link states that Sun approached NetApp about patents 18 months prior to the lawsuit being filed (to be clear that was Storagetek before Sun acquired them):
>The suit was filed in September 2007, in Texas, three years ago, but the spat between the two started 18 months before that, according to NetApp, when Sun's lawyers contacted NetApp saying its products violated Sun patents, and requesting licensing agreements and royalties for the technologies concerned.
And there was a copy of the original email from the lawyer which I sadly did not save a copy of, as referenced here:
https://ntptest.typepad.com/dave/2007/09/sun-patent-team.htm...
As for the presentation, I can't find it at the moment but will keep looking because I do remember it. That being said, a blog post from Val at the time specifically mentions NetApp, WAFL, how the team thought it was cool and decided to build your own:
https://web.archive.org/web/20051231160415/http://blogs.sun....
And the original paper on ZFS that appears to have been scrubbed from the internet mentions WAFL repeatedly (and you were a co-author so I'm not sure why you're saying you didn't reference NetApp or WAFL):
https://ntptest.typepad.com/dave/2007/09/netapp-sues-sun.htm...
https://www.academia.edu/20291242/Zfs_overview
>The file system that has come closest to our design principles, other than ZFS itself,is WAFL[8],the file system used internally by Network Appliance’s NFS server appliances.
That was unnecessary, but that does not betray even the slightest risk of violating NetApp's patents. It just brings attention.
Also, it's not true! The BSD 4.4 log-structured filesystem is such a close analog to ZFS that I think it's clear that it "has come closest to our design principles". I guess Bonwick et. al. were not really aware of LFS. Sad.
LFS had:
- "write anywhere"
- "inode file"
- copy on write
LFS did not have: - checksumming
- snapshots and cloning
- volume management
And the free space management story on LFS was incomplete.So ZFS can be seen as adding to LFS these things:
- checksumming
- birth transaction IDs
- snapshots, cloning, and later dedup
- proper free space management
- volume management, vdevs, raidz
I'm not familiar enough with WAFL to say how much overlap there is with WAFL, but I know that LFS long predates WAFL and ZFS. LFS was prior art! Plus there was lots of literature on copy-on-write b-trees and such in the 80s, so there was lots of prior art in that space.Even content-addressed storage (CAS) (which ZFS isn't quite) had prior art.
What I DO knows is that if the non-infringement were as open and shut as you and Bryan are suggesting, Apple probably wouldn't have scrapped years of effort and likely millions in R&D for no reason. It's not like they couldn't afford some lawyers to defend a frivelous lawsuit...
We don't know exactly what happened with Apple and Sun, but there were lots of indicia that Apple wanted indemnification and Sun was unwilling to go there. Why Apple really insisted on that, I don't know -- I think they should have been able to do the prior art search and know that NetApp probably wouldn't win their lawsuits, but hey, lawsuits are a somewhat random function and I guess Apple didn't want NetApp holding them by the short ones. DTrace they could remove, but removing ZFS once they were reliant on it would be much much harder.
And given what a litigious jackass Larry Ellison / Oracle is, I can't fault Apple for being nervous.
I think the truth is somewhere in the middle.
Another rumour was that Schwartz spilling the beans pissed Jobs off, which I wouldn't really put past him. Though I don't think it would have been enough to kill this.
I think all these little things added up and the end result was just "better not then".
I imagine the situation would have been different if Apple's ZFS integration had completed and shipped before Sun's demise.
They didn't rip out DTrace, after all.
The business case for providing a robust desktop filesystem simply doesn’t exist anymore.
20 years ago, (regular) people stored their data on computers and those needed to be dependable. Phones existed, but not to the extent they do today.
Fast forward 20 years, and many people don’t even own a computer (in the traditional sense, many have consoles). People now have their entire life on their phones, backed up and/or stored in the cloud.
SSDs also became “large enough” that HDDs are mostly a thing of the past in consumer computers.
Instead you today have high reliability hardware and software in the cloud, which arguably is much more resilient than anything you could reasonably cook up at home. Besides the hardware (power, internet, fire suppression, physical security, etc), you’re also typically looking at multi geographical redundancy across multiple data centers using reed-Solomon erasure coding, but that’s nothing the ordinary user needs to know about.
Most cloud services also offer some kind of snapshot functionality as malware protection (ie OneDrive offers unlimited snapshots for 30 days rolling).
Truth is that most people are way better off just storing their data in the cloud and making a backup at home, though many people seem to ignore the latter, and Apple makes it exceptionally hard to automate.
You would have early warning with ZFS. You have data loss with your plan.
/s
Because any corruption at any point will get synced as a change, or worse can cause failure.
And for the thin-provisioned snapshotted subvolume usecase, btrfs is currently eating ZFS's lunch due to far better Linux integration. Think snapshots at every update, and having a/b boot to get back to a known-working config after an update. So widespread adoption through the distro route is out of the question.
Is this a technical argument? Or is this just more licensing nonsense?
> Think snapshots at every update, and having a/b boot to get back to a known-working config after an update.
I use ZFS and I have snapshots on every update? I have snapshots hourly, daily, weekly and monthly. I have triggered snapshots too, and ad hoc dynamic snapshots too. I wrote about it here: https://kimono-koans.github.io/opinionated-guide/
Solaris had this about a decade ago with beadm:
* https://docs.oracle.com/cd/E53394_01/html/E54749/gpxnl.html
FreeBSD has had it for several years as well. If Linux lacks it that is a failure of Linux (distros) not ZFS.
Also, ZFS has a bad name within the Linux community due to some licensing stuff. I find that most BSD users don't really care about such legalese and most people I know that run FreeBSD are running ZFS on root. Which works amazingly well I might add.
Especially with something like sanoid added to it, it basically does the same as timemachine on mac, a feature that users love. Albeit stored on the same drive (but with syncoid or just manually rolled zfs send/recv scripts you can do that on another location too).
I don't think it's that they don't care, it's that the CDDL and BSD-ish licenses are generally believed to just not have the conflict that CDDL and GPL might. (IANAL, make your own conclusions about whether either of those are true)
I do have a feeling that Linux users in general care more about the GPL which is quite specific of course. Though I wonder if anyone chooses Linux for that reason.
But really personally I don't care whether companies give anything back, if anything I would love less corporate involvement in the OS I use. It was one of my main reasons for picking BSD. The others were a less fragmented ecosystem and less push to change things constantly.
This is out of an abundance of caution. Canonical bundle ZFS in the Ubuntu kernel and no one sued them (yet).
But really, this is a concern for distros. Not for end users. Yet many of the Linux users I speak to are somehow worried about this. Most can't even describe the provisions of the GPL so I don't really know what that's about. Just something they picked up, I guess.
None of this is a worry about being sued as an end user. But all of those are worries that you life will be harder with ZFS, and a lot harder as soon as the first lawsuits hit anyone, because all the current (small) efforts to keep it working will cease immediately.
But these days they want you to subscribe to their cloud storage so the versioning is done there, which makes sense in their commercial point of view.
I think snapshots on ZFS are better than time machine though. Time machine is a bit of a clunky mess of soft links that can really go to shit on a minor corruption. Leaving you with an unrestorable backup and just some vague error messages.
I worked a lot with macs and I've had my share of bad backups when trying to fix people's problems. I've not seen ZFS fail like that. It's really solid and tends to indicate issues before they lead to bigger problems.
I can't readily tell how much of the dumbness is from the filesystem and how much from the kernel but the end result is that until it gets away from 1980s version of file locking there's no prayer. Imagine having to explain to your boss that your .docx wasn't backed up because you left Word open over the weekend. A just catastrophically idiotic design
I have many criticisms of NTFS like it being really bad at handling large volumes of small files. But this is something it can do well.
The lock prevents other people from copying the file or opening it even in read only, yes. But backup software can back it up just fine.
What a weird take. BSD's license is compatible with ZFS, that's why. "Don't really care?" Really? Come on.
Personally I don't care about or obey any software licenses, as a user.
But this is kinda the vibe I get from other BSD users if a license discussion comes up. Maybe it's my bubble, that's possible.
Under other licensing, developers wield an extraordinary amount of power over the users. Yes, The user could opt not to run that code, but realistically that isn't an option in the modern day. Developers can and will abuse their access to your machine to serve their ends regardless of whether it adds value to you or not. For example, how much data collection is in nearly all modern software?
Perhaps you would argue that what I've said above only applies to a very tiny minority of users who have the technical skills to actually utilize the code, and everyone else It's just a religious argument. I don't fully disagree with that. There is another clear benefit That even those untechnical users received from the GPL, and that is the essentially forced contribution back from companies who want to build on top of it. I don't think there's any better example than the Linux kernel, which has gotten lots of contributions from companies that are otherwise very proprietary in nature and would never have open sourced things. This has benefited everyone and has acted as a rising tide lifting All boats. Without the requirements in the GPL, this most certainly would not happen.
My response to it however, is that those users still get a good amount of protection because The code is out there
Simply - the GPL has some clauses enforcing some obligations (to prevent some rights from being taken away from you, the end user - according to their wording, and I agree), these and other clauses make it legally incompatible with the inclusion of ZFS (CDDL license) in the Linux kernel (GPL). You can build it yourself (so indeed as a user you get to not care or obey) but not distribute it (this is the problem of your distribution's maintainer).
Canonical's lawyers think this is not a problem if the ZFS code is distributed as a module, instead of compiled into the kernel itself, and since 2016 Ubuntu shipped with ZFS support.
The BSD license is considered perfectly compatible with the inclusion of CDDL licensed code and therefore many BSD distros ship with ZFS (and Dtrace) out of the box without legal worries. Indeed Oracle hasn't come knocking.
TL;DR: it's not a vibe. Some licenses are compatible with each other, some aren't. It also depends on how different licenses come into play into a "finished product" (e.g. kernel module vs monolithic build)
Afaik, the FreeBSD position is both ZFS and UFS are fully supported and neither is secondary to the other; the installer asks what you want from ZFS, UFS, Manual (with a menu based tool), or Shell and you do whatever; in that order, so maybe a slight preferance towards ZFS.
The problem is that it is still owned by Oracle. And Solaris ZFS is incompatible with OpenZFS. Not that people really use Solaris anymore.
It is really unfortunate. Linux has adopted file systems from other operating systems before. It is just nobody trust Oracle.
LOL!!
I really hope they weren't friends, that really shatters my internal narrative (mainly because I can't actually picture either of them having actual friends).
This can lead to problems under sudden memory pressure. Because the ARC does not immediately release memory when the system needs it, userland pages might get swapped out instead. This behavior is more noticeable on personal computers, where memory usage patterns are highly dynamic (applications are constantly being started, used, and closed). On servers, where workloads are more static and predictable, the impact is usually less severe.
I do wonder if this is also the case on Solaris or illumos, where there is no intermediate SPL between ZFS and the kernel. If so, I don't think that a hypothetical native integration of ZFS on macOS (or even Linux) would adopt the ARC in its current form.
Not fast enough always.
The rollout of APFS a decade later validated this concern. There’s just no way that flawless transition happens so rapidly without a filesystem fit to order for Apple’s needs from Day 0.
What you describe hits my ear as more NIH syndrome than technical reality.
Apple’s transition to APFS was managed like you’d manage any kind of mass scale filesystem migration. I can’t imagine they’d have done anything differently if they’d have adopted ZFS.
Which isn’t to say they wouldn’t have modified ZFS.
But with proper driver support and testing it wouldn’t have made much difference whether they wrote their own file system or adopted an existing one. They have done a fantastic job of compartmentalizing and rationalizing their OS and user data partitions and structures. It’s not like every iPhone model has a production run that has different filesystem needs that they’d have to sort out.
There was an interesting talk given at WWDC a few years ago on this. The roll out of APFS came after they’d already tested the filesystem conversion for randomized groups of devices and then eventually every single device that upgraded to one of the point releases prior to iOS 10.3. The way they did this was to basically run the conversion in memory as a logic test against real data. At the end they’d have the super block for the new APFS volume, and on a successful exit they simply discarded it instead of writing it to persistent storage. If it errored it would send a trace back to Apple.
Huge amounts of testing and consistency in OS and user data partitioning and directory structures is a huge part of why that migration worked so flawlessly.
I don't know for certain if they could have done it with ZFS; but I can imagine it would at least been doable with some Apple extensions that would only have to exist during test / upgrade time.
[0] Part of why the APFS upgrade was so flawless was that Apple had done a test upgrade in a prior iOS update. They'd run the updater, log any errors, and then revert the upgrade and ship the error log back to Apple for analysis.
There are probably good reasons for Apple to reinvent ZFS as APFS a decade later, but none of them technical.
I also wouldn't call the rollout of APFS flawless, per se. It's still a terrible fit for (external) hard drives and their own products don't auto convert to APFS in some cases. There was also plenty of breakage when case-sensitivity flipped on people and software, but as far as I can tell Apple just never bothered to address that.
For zfs, there's been a lot of improvements over the years, but if they had done the fork and adapt and then leave it alone, their fork would continue to work without outside control. They could pull in things from outside if they want, when they want; some parts easier than others.
Now, old does not necessarily mean bad, but in this case….
https://iosref.com/ram-processor
People have run operating systems using ZFS on less.
Minor things like the indirect blocks being missing for a regular file only affect that file. Major things like all 3 copies of the MOS (the equivalent to a superblock) being gone for all uberblock entries would require recovery from backup.
If all copies of any other filesystem’s superblock were gone too, that filesystem would be equally irrecoverable and would require restoring from backup.
>> Apple can currently just take the ZFS CDDL code and incorporate it (like they did with DTrace), but it may be that they wanted a "private license" from Sun (with appropriate technical support and indemnification), and the two entities couldn't come to mutually agreeable terms.
> I cannot disclose details, but that is the essence of it.
* https://archive.is/http://mail.opensolaris.org/pipermail/zfs...
Apple took DTrace, licensed via CDDL—just like ZFS—and put it into the kernel without issue. Of course a file system is much more central to an operating system, so they wanted much more of a CYA for that.
That was the sticking point. In the context of the NetApp lawsuit Apple wanted indemnification should Sun/Oracle lose the suit.
This is the correct person: https://github.com/don-brady
Also can confirm Don is one of the kindest, nicest principal engineer level people I’ve worked with in my career. Always had time to mentor and assist.
ZFS: Apple’s New Filesystem That Wasn’t - https://news.ycombinator.com/item?id=11909606 - June 2016 (128 comments)
Is it fair to say ZFS made most sense on Solaris using Solaris Containers on SPARK?
[1]: https://www.theregister.com/2005/11/16/sun_thumper/
[2]: https://ubuntu.com/blog/zfs-is-the-fs-for-containers-in-ubun...
Although it does not change the answer to the original question, I have long been under the impression that part of the design of ZFS had been influenced by the Niagara processor. The heavily threaded ZIO pipeline had been so forward thinking that it is difficult to imagine anyone devising it unless they were thinking of the future that the Niagara processor represented.
Am I correct to think that or did knowledge of the upcoming Niagara processor not shape design decisions at all?
By the way, why did Thumper use an AMD Opteron over the UltraSPARC T1 (Niagara)? That decision seems contrary to idea of putting all of the wood behind one arrow.
As for Thumper using Opteron over Niagara: that was due to many reasons, both technological (Niagara was interesting but not world-beating) and organizational (Thumper was a result of the acquisition of Kealia, which was independently developing on AMD).
And there is also the Stratis project Red Hat is involved in: https://stratis-storage.github.io/
Still no checksumming though...
Sun salespeople tried to sell us the idea of "zfs filesystems are very cheap, you can create many of them, you don't need quota" (which ZFS didn't have at the time), which we tried out. It was abysmally slow. It was even slow with just one filesystem on it. We scrapped the whole idea, just put Linux on them and suddenly fileserver performance doubled. Which is something we weren't used to with older Solaris/Sparc/UFS or /VXFS systems.
We never tried another generation of those, and soon after Sun was bought by Oracle anyways.
You mean SPARC. And no, ZFS stands alone. But yes, containers were a lot faster to create using ZFS.
Note: sound drops out for a couple minutes at 1:30 mark but comes back.
A very long ago someone named cyberjock was a prolific and opinionated proponent of ZFS, who wrote many things about ZFS during a time when the hobbyist community was tiny and not very familiar with how to use it and how it worked. Unfortunately, some of their most misguided and/or outdated thoughts still haunt modern consciousness like an egregore.
What you are probably thinking of is the proposed doomsday scenario where bad ram could theoretically kill a ZFS pool during a scrub.
This article does a good job of explaining how that might happen, and why being concerned about it is tilting at windmills: https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...
I have never once heard of this happening in real life.
Hell, I’ve never even had bad ram. I have had bad sata/sas cables, and a bad disk though. ZFS faithfully informed me there was a problem, which no other file system would have done. I’ve seen other people that start getting corruption when sata/sas controllers go bad or overheat, which again is detected by ZFS.
What actually destroys pools is user error, followed very distantly by plain old fashioned ZFS bugs that someone with an unlucky edge case ran into.
To what degree can you separate this claim from "I've never noticed RAM failures"?
I got into overclocking both regular and ECC DDR4 ram for a while when AMD’s 1st gen ryzen stuff came out, thanks to asrock’s x399 motherboard which unofficially supporting ECC, allowing both it’s function and reporting of errors (produced when overlocking)
Based on my own testing and issues seen from others, regular memory has quite a bit of leeway before it becomes unstable, and memory that’s generating errors tends to constantly crash the system, or do so under certain workloads.
Of course, without ECC you can’t prove every single operation has been fault free, but as some point you call it close enough.
I am of the opinion that ECC memory is the best memory to overclock, precisely because you can prove stability simply by using the system.
All that said, as things become smaller with tighter specifications to squeeze out faster performance, I do grow more leery of intermittent single errors that occur on the order of weeks or months in newer generations of hardware. I was once able to overclock my memory to the edge of what I thought was stability as it passed all tests for days, but about every month or two there’d be a few corrected errors show up in my logs. Typically, any sort of stability is caught by manual tests within minutes or the hour.
ZFS does not need or benefit from ECC memory any more than any other FS. The bitflip corrupted the data, regardless of ZFS. Any other FS is just oblivious, ZFS will at least tell you your data is corrupt but happily keep operating.
> ZFS' RAM-hungry nature
ZFS is not really RAM-hungry, unless one uses deduplication (which is not enabled by default, nor generally recommended). It can often seem RAM hungry on Linux because the ARC is not counted as “cache” like the page cache is.
---
ZFS docs say as much as well: https://openzfs.github.io/openzfs-docs/Project%20and%20Commu...
Neither here nor there, but DTrace was ported to iPhone--it was shown to me in hushed tones in the back of an auditorium once...
[1]: https://arstechnica.com/gadgets/2016/06/a-zfs-developers-ana...
[2]: https://ahl.dtrace.org/2016/06/19/apfs-part5/#checksums
That is a notorious myth.
https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...
I don't think it is. I've never heard of that happening, or seen any evidence ZFS is more likely to break than any random filesystem. I've only seen people spreading paranoid rumors based on a couple pages saying ECC memory is important to fully get the benefits of ZFS.
Some of the things they say aren't credible, even if they're said often.
You don't need an enormous amount of ram to run zfs unless you have dedupe enabled. A lot of people thought they wanted dedupe enabled though. (2024's fast dedupe may help, but probably the right answer for most people is not to use dedupe)
It's the same thing with the "need" for ECC. If your ram is bad, you're going to end up with bad data in your filesystem. With ZFS, you're likely to find out your filesystem is corrupt (although, if the data is corrupted before the checksum is calculated, then the checksum doesn't help); with a non-checksumming filesystem, you may get lucky and not have meta data get corrupted and the OS keeps going, just some of your files are wrong. Having ECC would be better, but there's tradeoffs so it never made sense for me to use it at home; zfs still works and is protecting me from disk contents changing, even if what was written could be wrong.
I have a 64TB ZFS pool at home (12x8TB drives in an 11w1s RAID-Z3) on a personal media server. The machine has been up for months. It's using 3 GiB of RAM (including the ARC) out of the 32 I put in it.
If you have no mirrors and no raidz and no ditto blocks then errors cause problems, yes. Early on they would cause panics.
But this isn't ZFS "corrupting itself", rather, it's ZFS saving itself and you from corruption, and the price you pay for that is that you need to add redundancy (mirrors, raidz, or ditto blocks). It's not a bad deal. Some prefer not to know.
What's a bit flip?
Usually attributed to "cosmic rays", but really can happen for any number of less exciting sounding reasons.
Basically, there is zero double checking in your computer for almost everything except stuff that goes across the network. Memory and disks are not checked for correctness, basically ever on any machine anywhere. Many servers(but certainly not all) are the rare exception when it comes to memory safety. They usually have ECC(Error Correction Code) Memory, basically a checksum on the memory to ensure that if memory is corrupted, it's noticed and fixed.
Essentially every filesystem everywhere does zero data integrity checking:
MacOS APFS: Nope
Windows NTFS: Nope
Linux EXT4: Nope
BSD's UFS: Nope
Your mobile phone: Nope
ZFS is the rare exception for file systems that actually double check the data you save to it is the data you get back from it. Every other filesystem is just a big ball of unknown data. You probably get back what you put it, but there is zero promises or guarantees.I'm not sure that's really accurate -- all modern hard drives and SSD's use error-correcting codes, as far as I know.
That's different from implementing additional integrity checking at the filesystem level. But it's definitely there to begin with.
But there is ABSOLUTELY NO checksum for the bits stored on a SSD. So bit rot at the cells of the SSDs are undetected.
It has been years since I was familiar enough with the insides of SSDs to tell you exactly what they are doing now, but even ~10-15 years ago it was normal for each raw 2k block to actually be ~2176+ bytes and use at least 128 bytes for LDPC codes. Since then the block sizes have gone up (which reduces the number of bytes you need to achieve equivalent protection) and the lithography has shrunk (which increases the raw error rate).
Where exactly the error correction is implemented (individual dies, SSD controller, etc) and how it is reported can vary depending on the application, but I can say with assurance that there is no chance your OS sees uncorrected bits from your flash dies.
While true, there is zero promises that what you meant to save and what gets saved are the same things. All the drive mostly promises is that if the drive safely wrote XYZ to the disk and you come back later, you should expect to get XYZ back.
There are lots of weasel words there on purpose. There is generally zero guarantee in reality and drives lie all the time about data being safely written to disk, even if it wasn't actually safely written to disk yet. This means on power failure/interruption the outcome of being able to read XYZ back is 100% unknown. Drive Manufacturers make zero promises here.
On most consumer compute, there is no promises or guarantees that what you wrote on day 1 will be there on day 2+. It mostly works, and the chances are better than even that your data will be mostly safe on day 2+, but there is zero promises or guarantees. We know how to guarantee it, we just don't bother(usually).
You can buy laptops and desktops with ECC RAM and use ZFS(or other checksumming FS), but basically nobody does. I'm not aware of any mobile phones that offer either option.
I'm not really sure what point you're trying to make. It's using ECC, so they should be the same bytes.
There isn't infinite reliability, but nothing has infinite reliability. File checksums don't provide infinite reliability either, because the checksum itself can be corrupted.
You keep talking about promises and guarantees, but there aren't any. All there is are statistical rates of reliability. Even ECC RAM or file checksums don't offer perfect guarantees.
For daily consumer use, the level of ECC built into disks is generally plenty sufficient. It's chosen to be so.
We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.
We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.
You can see some stats using `smartctl`.
My point was, on most consumer compute, there is no promises or guarantees that what you see on day 1 will be there on day 2. It mostly works, and the chances are better than even that your data will be mostly safe on day 2, but there is zero promises or guarantees, even though we know how to do it. Some systems do, those with ECC memory and ZFS for example. Other filesystems also support checksumming, like BTRFS being the most common counter-example to ZFS. Even though parts of BTRFS are still completely broken(see their status page for details).
This is so not true.
All the high speed busses (QPI, UPI, DMI, PCIe, etc.) have "bit flip" protection in multiple layers: differential pair signaling, 8b/10b (or higher) encoding, and packet CRCs.
Hard drives (the old spinning rust kind) store data along with a CRC.
SSD/NVMe drives use strong ECC because raw flash memory flips so many bits that it is unusable without it.
If most filesystems don't do integrity checks it's probably because there's not much need to.
> If most filesystems don't do integrity checks it's probably because there's not much need to.
I would disagree that disks alone are good enough for daily consumer use. I see corruption often enough to be annoying with consumer grade hardware without ECC & ZFS. Small images are where people usually notice. They tend to be heavily compressed and small in size means minor changes can be more noticeable. In larger files, corruption tends to not get noticed as much in my experience.
We have 10k+ consumer devices at work and corruption is not exactly common, but it's not rare either. A few cases a year are usually identified at the helpdesk level. It seems to be going down over time, since hardware is getting more reliable, we have a strong replacement program and most people don't store stuff locally anymore. Our shared network drives all live on machines with ECC & ZFS.
We had a cloud provider recently move some VM's to new hardware for us, the ones with ZFS filesystems noticed corruption, the ones with ext4/NTFS/etc filesystems didn't notice any corruption. We made the provider move them all again, the second time around ZFS came up clean. Without ZFS we would have never known, as none of the EXT4/NTFS FS's complained at all. Who knows if all the ext4/NTFS machines were corruption free, it's anyone's guess.
ZFS has been in productions work loads since 2005, 20 years now. It's proven to be very safe.
BTRFS has known fundamental issues past one disk. It is however improving. I will say BTRFS is fine for a single drive. Even the developers last I checked(a few years ago) don't really recommend it past a single drive, though hopefully that's changing over time.
I'm not familiar enough with bcachefs to comment.
But the real win (at least for me) - every device i have - laptops, desktops, server, even PLC's (they now use freebsd under the covers + ZFS) all backup using zfs snapshots and replication.
I do not ever worry about finding an old file i accidentally deleted. Or restoring a backup to a new machine and "did it really include everything" or anything else.
The machine storing backups is itself replicated to another machine in my detached garage.
If i wanted even more security, i could trivially further replicate it to offsite storage in the same manner.
All of this takes ~0 time to set up, and require 0 maintenance to keep working.
Meanwhile, Apple has gone backwards - time machine can't even make actual full system backups anymore.