Btrfs Allocator Hints(lwn.net)

57 pointsby forza_user7 months ago5 comments

forza_user7 months ago
I was surprised of the new attempt for performance profiles/device roles/hints when we already have a very good patch set maintained by kakra.
- https://github.com/kakra/linux/pull/36
- https://wiki.tnonline.net/w/Btrfs/Allocator_Hints
What do you think?
- dontdoxxme7 months ago
  > One of the reasons why these patches are not included in the kernel is that the free space calculations do not work properly.
  It seems these patches possibly fix that.
sandreas7 months ago
Well, first of all: I'm not trying to bash BTRFS at all, it probably is just not meant for me. However, I'm trying to gain information it is really considered stable (like rock solid) or it might have been a hardware Problem on my system.
I used cryptsetup with BTRFS because I encrypt all of my stuff. One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]). Not a real problem because I had a recent backup, but somehow I lost trust in BTRFS that day. Anyone experienced something like that?
Since then I switched to ZFS (on the same hardware) and never had problems - while it was a real pain to setup until I finished my script [2], which still is kind of a collection of dirty hacks :-)
1: https://forum.cgsecurity.org/phpBB3/viewtopic.php?t=13013
2: https://github.com/sandreas/zarch
- ghostly_s7 months ago
  Yes, my story with btrfs is quite similar- used it for a couple years, suddenly threw some undocumented error and refused to mount, asked about it on the dev irc channel and was told apparently it was a known issue with no solution, have fun rebuilding from backups. No suggestion that anyone was interested in documenting this issue, let alone fixing it.
  These same people are the only ones in the world suggesting btrfs is "basically" stable. I'll never touch this project again with a ten foot pole, afaic it's run by children. I'll trust adults with my data.
  - sandreas7 months ago
    Ok, thank you. At least I'm not alone with this. However, I'm not too much into it and would not go as far to say it's not a recommendable project, but boy was I mad it just died without any way to recover ANYTHING :-)
    taskforcegemini7 months ago
    probably depends on where the issue is located, but is this not normally the case with encrypted drives?
    sandreas7 months ago
    No. Encrypted drives should be recoverable as long as you have the valid decryption values.
    I think it had nothing to do with the encryption layer... the FS layer was the problem.
  - amy2147 months ago
    I ran it on opensuse and it would 100% lock out a core on some sort of crontab tree structure rebalancing (?) .. I mean.. hello? online algorithm? Dynamic rebalancing? scheduled FS restructuring, really? ReiserFS dancing trees from 20 years ago? If thats how they think, "meh, the user just has to deal", no wonder this is how they handle bugs.
- ChocolateGod7 months ago
  I've used it as my desktops main filesystem for many years and not had any problems. I have regular snapshots with snapper. I run the latest kernel, so ZFS is not an option.
  That said, I avoid it like the plague on servers, to get acceptable performance (or avoid fragmentation) with VMs or databases you need to disable COW which disables many of it's features, so it's better just to roll with XFS (and get pseudo-snapshots anyway).
  - homebrewer7 months ago
    In the unlikely case you're running SQLite, it's possible to get okay performance on btrfs too:
    https://wiki.tnonline.net/w/Blog/SQLite_Performance_on_Btrfs
- pa7ch7 months ago
  I worked on a linux distro some years ago that had to pull btrfs long after people had started saying thats its truly solid because customers had so many issues. Its probably improved since but its hard to know. Im surprised fedora workstation defaults to it now. I'm hoping bcachefs finds its way in the next few years as being the rock solid fs it aims to be.
  - BlimpSpike7 months ago
    I hadn't heard of bcachefs, but I looked it up and apparently Linus just removed it from the kernel source tree last month for non-technical reasons.
    https://en.wikipedia.org/wiki/Bcachefs#History
    koverstreet7 months ago
    He hasn't, yet, but it's anyone's guess what's he's going to do.
    Regardless, development isn't going to stop, we may just have to switch to shipping as a DKMS module. And considering the issues we've had with getting bugfixes out that might have been the better way all along.
  - sandreas7 months ago
    Yeah what really made me wonder is that I thought I had incomplete and wrong manpages in the recovery sections... examples did not work as described, but I can't remember what it was, I was too mad and ditched it completely :-)
- gavinsyancey7 months ago
  My btrfs filesystem has been slowly eating my data for a while; large files will find their first 128k replaced with all nulls. Rewriting it will sometimes fix it temporarily, but it'll revert back to all nulls after some time. That said, this might be my fault for using raid6 for data and trying to replace a failing disk a while ago.
  - homebrewer7 months ago
    raid 5/6 is completely broken and there's not much interest in fixing it — nobody who's willing to pay for its development (which includes Facebook, SUSE, Oracle, and WD) uses raid 5/6; you shouldn't have been running it in the first place. I understand it's basically blaming the victim, but doing at least some research on the filesystem before starting to use it is a good idea in any case.
    https://btrfs.readthedocs.io/en/latest/Status.html
    edit: just checked, it says the same thing in man pages — not for production use, testing/development only.
    mdedetrich7 months ago
    iirc, btrfs has fixed the issues with raid 5/6 but it requires a breaking change to the on disk format which means you have to create an entirely new partition and copy the data over (you cannot update an existing partition to it). This new on disk format also needs its own testing.
    Your point raid 5/6 not being tested heavily by actual users is entirely on point, those enterprise heavy users are only running RAID 10 like configurations.
    If you want RAID 5/6, just use ZFS as they have solved all of these issues. I don't know if its due to sheer luck or maybe the fact is that Sun at its time was actually running RAID 5/6 in production (hard drives were not as cheap back then as they are now)?
- fpoling7 months ago
  Have you used 4K sectors with cryptsetup? Many distributions still defaults to 512 bytes if SSD reports 512 bytes as its logical size and with 512 sectors there are heavier load on the system.
  I was reluctant to use BTRFS on my Linux laptop but for the last 3 years I have been using it with 4K cryptsetup with no issues.
  - sandreas7 months ago
    I used the default archinstall... did not check the sector size, but good to hear it works for you. Maybe I'll check again with my next setup.
    homebrewer7 months ago
    FWIW, basically all of arch linux' infrastructure has been running on top of btrfs for several years, and last time I asked them, they didn't have any more problems with it than with any other filesystem.
    https://gitlab.archlinux.org/archlinux/infrastructure
- riku_iki7 months ago
  > One day, the system froze and after reboot the partition was unrecoverably gone (the whole story[1]).
  it looks like you didn't use raid, so any FS could fail in case of disk corruption.
  - sandreas7 months ago
    Thank you for your opinion. Well... it did not just fail. Cryptsetup mounted everything fine, but the BTRFS tools did not find a valid filesystem on it.
    While it could have been a bit flip that destroyed the whole encryption layer, BTRFS debugging revealed that there was some traces of BTRFS headers after mounting cryptsetup and some of the data on the decrypted partition was there...
    This probably means the encryption layer was fine. The BTRFS part just could not be repaired or restored. The only explanation I have for this that something resulted in a dirty write, which destroyed the whole partition table, the backup partition table and since I used subvolumes and could not restore anything, most of the data.
    Well, maybe it was my fault but since I'm using the exact same system with the same hardware right now (same NVMe SSD), I really doubt that.
    riku_iki7 months ago
    > Well, maybe it was my fault but since I'm using the exact same system with the same hardware right now (same NVMe SSD), I really doubt that.
    anecdotes could be exchanged in both directions: I run heavy data processing with max possible throughput on top of btrfs raid for 10 years already, and never had any data loss. I am absolutely certain if you expect data integrity while relying on single disk: it is your fault.
    theamk7 months ago
    The reliability is about variety of workloads, not amount of data or throughput. It's easy to write a filesystem which works well in the ideal case, it's the bad or unusual traffic patters which cause problem. For all that I know maybe that btrfs complete failure was because of kernel crash caused by bad USB hardware. Or there was a cosmic ray hitting memory chip.
    But you know who's fault is it? It's btrfs's one. Other filesystems don't lose entire volumes that easily.
    Over time, I've abused ext4 (and ext3) in all sorts of ways: override random sector, mount twice (via loop so kernel's double-mount detector did not work), use bad SATA hardware which introduced bit errors.. There was some data loss, and sometimes I had to manually sort though tens of thousands of files in "lost+found" but I did not lose the entire filesystem.
    I only saw the "entire partition loss" only happened to me when we tried btrfs. It was a part of ceph cluster so no actual data was lost.. but as you may guess we did not use btrfs ever again.
    riku_iki7 months ago
    > but as you may guess we did not use btrfs ever again.
    there are scenarious where btrfs is currently can't be replaced: high performance + data compression.
    theamk7 months ago
    Sure, I can believe this. Does not change the fact that some people encounter compete data loss with it.
    Sadly, there are people (and distributions) which recommend btrfs for general-purpose root filesystem, even for the cases where reliability matters much more than performance. I think that part is a mistake,
    riku_iki7 months ago
    I would recommend btrfs as general purpose root filesystem. Any FS will have people encountering data loss. I can believe btrfs has N times higher chance of data loss because its packed with features and need to maintain various complicated indexes which are easier to corrupt, but I also believe that one should be ready that his disk will fail any minute regardless of FS, and do backup/replication accordingly.
    sandreas7 months ago
    While I did that and lost near to nothing, I still think that this should not be the default approach of developing a filesystem... it should be ready to restore as much as possible in case of hardware failure or data corruption.
    riku_iki7 months ago
    there is standard approach: you setup raid, and FS will restore as much as possible and likely everything. Adding extra complexity to cover some edge cases maybe is too overkill.
    mdedetrich7 months ago
    OpenZFS does a better job here, at least if you can deal with an out of tree filesystem.
    riku_iki7 months ago
    actually, my personal benchmarks and multiple accounts in internet say it is much slower than btrfs under the load.
    mdedetrich7 months ago
    For smaller disk setups possibly but with large enough scale ZFS ends up beating out btrfs.
    riku_iki7 months ago
    I test on 2TB datasets. Do you have any specific pointers which would support your claim?
    mdedetrich7 months ago
    I have 100+TB datasets and with a large enough SSD/RAM for L1/L2 arc ZFS edges out.
    Hell even the compression algorithm that ZFS has uses/has access to (LZ4) is faster than what btrfs uses and with enough IO that matters.
    riku_iki7 months ago
    > I have 100+TB datasets and with a large enough SSD/RAM for L1/L2 arc ZFS edges out.
    and your claim is that you tested it against btrfs on the same workload? Maybe you could post some specific numbers from running command from this thread? https://www.reddit.com/r/zfs/comments/1i3yjpt/very_poor_perf...
    > Hell even the compression algorithm that ZFS has uses/has access to (LZ4) is faster than what btrfs uses and with enough IO that matters.
    lz4 compression rate was 2x vs 7x for zstd on my data (bunch of numbers), so I didn't see point of uzing lz4 compression at all because benefits are not large enough.
  - ahofmann7 months ago
    What the hell are you talking about? Any filesystem on any OS I've seen the last 3 decades had some kind of recovery path after any crash. Some of them lose more data, some of them less. But being unable to mount, is a bug that makes a filesystem untrustworthy and useless.
    And how would RAID help in that situation?
    riku_iki7 months ago
    > But being unable to mount, is a bug that makes a filesystem untrustworthy and useless.
    we are in disagreement on this. If partition table entry corrupted, you can't mount without some low level surgery.
    > And how would RAID help in that situation?
    depending on raid, your data will be duplicated on another disk, and will survive in case of one/few disks corruption.
    ahofmann7 months ago
    The partition table gets mostly written only once in the lifetime of a filesystem/disk. So it almost never corrupts during an os crash.
    There are a lot of RAIDs and configurations. Some of them may do what you describe, but most don't.
    riku_iki7 months ago
    Ok, I know this, not sure what is your point.
bjoli7 months ago
I wonder if I can use a smaller SSD for this and make it avoid HDD wakeups due to some process reading metadata. That alone would make me love this feature.
- the84727 months ago
  I think you'd rather want a cache device (or some more complicated storage tiering) for that so that both metadata and frequently accessed files get moved to that dynamically based on access patterns. Afaik btrfs doesn't support that. LVM, bcache, device mapper, bcachefs and zfs support that (though zfs would require separate caches for reading and synchronous write). And idk which of these let you control the writeback interval.
  - viraptor7 months ago
    Bcache allows lots of writeback configuration, including intervals https://www.kernel.org/doc/html/latest/admin-guide/bcache.ht...
- bionade247 months ago
  Most likely yes, but the also envisioned periodically repacking oft multiple small data extents into one big that gets written to the HDD would wake up the HDD. And if you'd make the SSD "metadata only", browser cache and logging will keep the HDD spinning.
  This feature is for performance, not the case you described.
  - bjoli7 months ago
    The disk is unused except for once a day things get backed up to it (and other places, of course). Nothing will get written to it except for when it is getting written to for the backup.
    I will definitely try this.
- ajross7 months ago
  Just buy more RAM and you get that for free. Really I guess that's my sense of patches like this in general: while sure, filesystem research has a long and storied history and it's a very hard problem in general that attracts some of the smartest people in the field to do genius-tier work...
  Does it really matter in the modern world where a vanilla two-socket rack unit has a terabyte of DRAM? Everything at scale happens in RAM these days. Everything. Replicating across datacenters gets you all the reliability you need, with none of the fussing about storage latency and block device I/O strategy.
  - bayindirh7 months ago
    Actually, it doesn't work like that.
    Sun's ZFS7420 had a terabyte of RAM per controller, and these work in tandem, and after a certain pressure, the thing can't keep up even though it also uses specialized SSDs to reduce HDD array access during requests, and these were blazingly fast boxes for their time.
    When you drive a couple thousand physical nodes with a some-petabytes sized volumes, no amount of RAM can save you. This is why Lustre divides metadata servers and volumes from file ones. You can keep very small files in metadata area (a-la Apple's 0-sized, data-in-resource-fork implementation), but for bigger data, you need to have good filesystems. There are no workarounds from this.
    If you want to go faster, take a look at Weka and GPUDirect. Again, when you are pumping tons of data to your GPUs to keep them training/inferring, no amount of RAM can keep that data (or sustain the throughput) during that chaotic access for you.
    When we talked about performance, we used to say GB/sec. Now a single SSD provides that IOPS and throughput provided by storage clusters. Instead, we talk about TB/sec in some cases. You can casually connect terabit Ethernet (or Infiniband if you prefer that) to a server with a couple of cables.
    ajross7 months ago
    > When you drive a couple thousand physical nodes with a some-petabytes sized volumes
    You aren't doing that with ZFS or btrfs, though. Datacenter-scale storage solutions (c.f. Lustre, which you mention) have long since abandoned traditional filesystem techniques like the one in the linked article. And they rely almost exclusively on RAM behavior for their performance characteristics, not the underlying storage (which usually ends up being something analogous to a pickled transaction log, it's not the format you're expected to manage per-operation)
    bayindirh7 months ago
    > You aren't doing that with ZFS or btrfs, though.
    ZFS can, and is actually designed to, handle that kind of workloads, though. At full configuration, ZFS7420 is a 84U configuration. Every disk box has its own set of "log" SSDs and 10 additional HDDs. Plus it was one of the rare systems which supported Infiniband access natively, and was able to saturate all of its Infiniband links under immense load.
    Lustre's performance is not RAM bound when driving that kind of loads, this is why MDT arrays are smaller and generally full-flash while OSTs can be selected from a mix of technologies. As I said, when driving that number of clients from a relatively small number of servers, it's not possible to keep all the metadata and query it from the RAM. Yes, Lustre recommends high RAM and core count for servers driving OSTs, but it's for file content throughput when many clients are requesting files, and we're discussing file metadata access primarily.
    ajross7 months ago
    Again I think we're talking past each other. I'm saying "traditional filesystem-based storage management is not performance-limited at scale where everything is in RAM, so I don't see value to optimizations like that". You seem to be taking as a prior that at scale everything doesn't fit in RAM, so traditional filesystem-based storage management is still needed.
    But... everything does fit in RAM at scale. I mean, Cloudflare basically runs a billion dollar business who's product is essentially "We store the internet in RAM in every city". The whole tech world is aflutter right now over a technology base that amounts to "We put the whole of human experience into GPU RAM so we can train our new overlords". It's RAM. Everything is RAM.
    I'm not saying there is "no" home for excessively tuned genius-tier filesystem-over-persistent-storage code. I'm just saying that it's not a very big home, that the market has mostly passed the technology over, and that frankly patches like the linked article seem like a waste of effort to me vs. going to Amazon and buying more RAM.
    piperswe7 months ago
    Cloudflare's cache is a tiered cache with RAM and SSDs, not just RAM.
    source: https://blog.cloudflare.com/why-we-started-putting-unpopular...
    > Our storage layer, which serves millions of cache hits per second globally, is powered by high IOPS NVMe SSDs.
    j16sdiz7 months ago
    These patches came from oracle. Pretty sure they have a client somewhere needs this.
    bayindirh7 months ago
    No, it doesn't. You think in a very static manner. Yes, you can fit websites in RAM, but you can't fit the databases powering them. Yes, you can fit some part of the videos or images you're working on or serving on RAM, but you can't store whole catalogs in RAM.
    Moreover, you again give examples from the end product. Finished sites, compacted JS files, compressed videos, compiled models...
    There's much more than that. The model is in RAM, but you need to rake tons of data over that GPU. Sometimes terabytes of data. You have raw images to process, raw video to color-grade, unfiltered scientific data to sift through. These files are huge.
    A well processed JPG from my camera is around 5MB, but RAW version I process is 25MB per frame, and it's a 24MP image, puny for today's standards. Your run of the mill 2K video takes a couple of GBs after final render at movie length. RAWs take 10s of terabytes, at minimum. Unfiltered scientific data again comes in terabytes to petabytes range depending on your project and instruments you work on, and multiple such groups pull their own big datasets to process real-time.
    In my world, nothing fits in RAM except the runtime data, and that's your application plus some intermediate data structures. The rest is read from small to gigantic files and written in files of unknown sizes, by multiple groups, simultaneously. These systems experience the real meaning of "saturation", and they would really swear at us at some cases.
    Sorry, but you can't solve this problem by buying more RAM, because these workloads can't be carried to clouds. They need to be local, transparent and fast. IOW, you need disk systems which feel like RAM. Again, look what Weka (https://www.weka.io/) does. It's one of the most visible companies which make systems behave like a huge RAM, but with multiple machines and tons of cutting edge SSDs, because what they process doesn't fit in RAM.
    Lastly, oh, there's a law which I forget its name every time, which tells you if you cache 10 most used files, you can serve up to 90% of your requests from that cache, if your request pattern is static. In cases I cite, there's no "popular" file. Everybody wants their own popular files which makes access "truly random".
    koverstreet7 months ago
    Besides ZFS (and I've heard of exabyte sized ZFS filesystems), bcachefs is absolutely capable of petabyte sized filesystems.
  - guenthert7 months ago
    Some time ago (back when we were using spinning rust) I was wondering whether one could bypass the latency of disk access when replicating to multiple hosts. I mean, how likely is it, that two hosts crash at the same time? Well, it turns out that there are some causes which take out multiple hosts simultaneously (a way too common occurrence seems to be diesel generators which fail to start after power failure). I think the good fellas at Amazon, Meta and Google even have stories to tell about a whole data center failing. So you need replication across data centers, but then network latency bites ya. Current NVMe storage devices are then faster (and for some access patterns nearly as fast as RAM).
    And that's just at the largest scale. I'm pretty sure banks still insist that the data is written to (multiple) disks (aka "stable storage") before completing a transaction.
  - homebrewer7 months ago
    > Does it really matter in the modern world
    Considering that multiple ZFS developers get paid to make ZFS work well on petabyte-sized disk arrays with SSD caching, and one of them often reports on progress in this area in his podcasts (2.5admins.com and bsdnow if you're interested) .. then yes?
  - bjoli7 months ago
    I have 128gb, of which about 100 goes unused most of the time. It seems to have zero effect on this.
hyperman17 months ago
I feel a bit lost here. In the good old days, I ran ext2/ext3/ext4 and forgot about it, or Reiserfs if I felt fancy (and which was great until it wasn't).
Now, there is a cambrian explosion going on. Ext4, xfs, btrfs,bcachefs, zfs. They each have their pros and cons, and it takes a while before you find out you run into an expensive limit. E.g. Ext3/4 is good, until it ran out of inodes. ZFS is good, but has only 1 password for full disk encryption and I want to store a second one with IT. According to the jungle drums, btrfs eats your data once in a while. Bcachefs stupidly tries to get itself rejected from Linux, not good for long term stability. I'm on XFS now, but let's see how that ends.
- rcxdude7 months ago
  That doesn't really match my recollection of timeline. I remember xfs being mentioned in the same sources contemporary with reiserfs (it predates ext3, even!). ZFS is about a decade newer, but not by much, and was probably the main reason most people would pay any real attention to their filesystem at that point, since it meaningfully added features not available in anything else at that point. BTRFS was basically a 'let's build the same thing, but in linux', but seems to have kinda stalled in terms of reliability (or at least in terms of reputation), and bcachefs is very much the new kid on the block, but seems to have a little bit more of a focus on getting to the reliability of ZFS, but it certainly is still not something to trust even as much as BTRFS. So it doesn't really feel like a cambrian explosion, more a new filesystem every ~5 years or so at a reasonably steady pace.
  (pretty much the 3 filesystems I think about ATM are ext4 as a standard boot drive, zfs for large, long-lived data storage, and FAT/exFAT for interoperability with windows. It'd have to be a pretty niche use-case for me to consider another option. BcacheFS sounds really interesting but only to experiment with right now)
- mdedetrich7 months ago
  FDE with ZFS is kind of fighting the way things are meant to be done with ZFS. ZFS allows encryption on a per dataset/zvol basis which is the officially recommended way to do encryption (see https://arstechnica.com/gadgets/2021/06/a-quick-start-guide-...)
- swinglock7 months ago
  And XFS will at unexpected shutdowns sometimes leave you with files that previously contained data now being 0 bytes.
  I only really trust ZFS on Linux, but it's such a bother it can't be upstreamed and isn't fully integrated with the native Linux caching, as the native file systems are. Ext is fine too but it's missing features like checksumming and compression, and has limitations as you mentioned.
7 months ago
undefined