How to save the world with ZFS and 12 USB sticks: 4th anniversary video (2011)(constantin.glez.de)

98 pointsby mariuz3 days ago3 comments

oofbey3 days ago
It’s funny which things have changed and which haven’t. That server which was super impressive for 2007 had 4 cores and 16GB of ram. Which is a reasonable small laptop for today. The 24TB of disk would still be pretty big though.
- ghshephard3 days ago
  Nowadays most cloud customers jam their racks full of absolutely vanilla (brand is mostly irrelevant) 128 vCores / 64 physical cores, 512 GByte servers for ~$18k - w/100 GBit NICs. That Sun 4500, maxed out (with 10 Gbit/Nics), sold for $70k. ($110K in 2025 dollars).
  What's (still) super impressive was the 48 drives. Looking around -the common "Storage" nodes in rack these days seem to be 24x24TB CMR HDD + 2 7.68 TB NVME SDD (and a 960 GB Boot Disk) - I don't know if anyone really uses 48 drive systems commonly (outside the edge cases like Backblaze and friends)
- bombcar3 days ago
  24 TB is available on a single drive now, though.
  - temp08263 days ago
    I haven't kept up with spinning rust drives so I had to take a look. Seagate has a couple 30 tb models now, crazy. Lot of eggs in one basket...and through the same ol 6gbit sata interface these must be a nightmare to format or otherwise deal with. Impressive nonetheless
    CTDOCodebases3 days ago
    The new drives have dual actuators to improve performance:
    https://www.seagate.com/au/en/innovation/multi-actuator-hard...
    https://www.youtube.com/watch?v=5eUyerocA_g
    guenthert2 days ago
    I'm pretty sure that there were such drives more than twenty years ago (not popular though). I have to ask, what's the point today? The average latency goes down at most linearly with the number of actuators. One would need thousands to match SSDs. For anything but pure streaming (archiving), spinning rust seems questionable.
    Edit: found it (or at least one) "MACH.2 is the world’s first multi-actuator hard drive technology, containing two independent actuators that transfer data concurrently."
    World's first my ass. Seagate should know better, since it was them who acquired Connor Peripherals some thirty years ago. Connor's "Chinook" drives had two independent arms. https://en.wikipedia.org/wiki/Conner_Peripherals#/media/File...
    namibj2 days ago
    Those HDDs, if single-actuator, spend around 2~4 MB of streaming potential per seek.
    That means, if you access files of exactly that size you'd "only" half your iops.
    HDDs are quite fine for data chunks in the megabytes.
    amy_petrik2 days ago
    >HDDs are quite fine for data chunks in the megabytes.
    Exactly. SSD fanboys show me a similarly priced 30 TB SSD and we can discuss. A bit like internal combustion vs e=car - the new tech is in principle simpler and cheaper, in practice simpler and pricier, with the promise of "one day" - but I suppose LCDs were once in a similar place so it may be a matter of time
    godelski3 days ago
    Btw, you can get refurbished ones for relatively cheap too. ~$350[0]. I wouldn't put that in an enterprise backup server, but pretty good deal for home storage if you're implementing raid and backups.
    [0] https://www.ebay.com/itm/306235160058
    bombcar2 days ago
    Prices have soared recently because AI eats storage as well as GPU; but tracking the data hoarder sites can be worthwhile. Seagate sometimes has decent prices on new.
    genpfault2 days ago
    > Seagate sometimes has decent prices on new.
    Make sure to check the "annual powered-on hours" entry in the spec sheet though, sometimes it can be significantly less than ~8766 hours.
    godelski2 days ago
    Probably a good time to mention systemd automount. This will auto mount and unmount drives as needed. You save on your energy bill but the trade off is that first read takes longer as drives need to mount.
    You need 2 files, the mount file and the automount file. Keep this or something similar as a skeleton file somewhere and copy over as needed
    # /etc/systemd/system/full-path-drive-name.mount [Unit] Description=Some description of drive to mount Documentation=man:systemd-mount(5) man:systemd.mount(5) [Mount] # Find with `lsblk -f` What=/dev/disk/by-uuid/1abc234d-5efg-hi6j-k7lm-no8p9qrs0ruv # See file naming scheme Where=/full/path/drive/name # https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/storage_administration_guide/sect-using_the_mount_command-mounting-options#sect-Using_the_mount_Command-Mounting-Options Options=defaults,noatime # Fails if mounting takes longer than this (change as appropriate) TimeoutSec=1m [Install] # Defines when to load drive in bootup. See `man systemd.special` WantedBy=multi-user.target # /etc/systemd/system/full-path-drive-name.automount [Unit] Description=Automount system to complement systemd mount file Documentation=man:systemd.automount(5) Conflicts=umount.target Before=umount.target [Automount] Where=/full/path/drive/name # If not accessed for 15 minutes drive will spin down (change as appropriate) TimeoutIdleSec=15min [Install] WantedBy=local-fs.target
    wongarsu3 days ago
    If you have one drive it feels like that, but if you throw 6+2 drives into a RAID6/raidz2. Sure, a full format can take 3 days (at 100 Megabytes/second sustained speed), but it's not like you are watching it. The real pain is fining viable backup options that don't cost an arm and a leg
    nubinetwork3 days ago
    If your drives are only managing 100MB/s then something is wrong, SATA 3 should be at least 500MB/s.
    wongarsu3 days ago
    SATA 3 can move 500MB/s, but high-capacity drives typically can't. They are all below 300MB/s sustained even when shiny new. Look for example at the performance numbers quoted in these data sheets [1][2][3][4], all between 248 MiB/s and 272 MiB/s.
    Now that's still a lot faster than 100MB/s. But I have a lot of recertified drives, and while some of them make the advertised numbers some of them have settled at 100MB/s. You could argue that is something wrong with them, but they are in a raid and I don't need them to be fast. That's what the SSD cache is for.
    1: Page 3 https://www.seagate.com/content/dam/seagate/en/content-fragm...
    2: Page 2 https://www.seagate.com/content/dam/seagate/en/content-fragm...
    3: Page 2 https://www.westerndigital.com/content/dam/doc-library/en_us...
    4: Page 7 https://www.seagate.com/content/dam/seagate/assets/products/...
    guenthert2 days ago
    Spinning rust drives tend to be much faster on the outer than the inner tracks.
    nubinetwork3 days ago
    I had a 12 disk striped raidz2 array comprised of wd gold drives that could push 10Gbit/s over the network, while scrubbing, while running 10 virtual machines, and still had plenty of IO to play with. /shrug
    elygre2 days ago
    Unfortunately, I’m absolutely watching it :-(
    It happens whenever there is a progress indicator. I get obsessed with monitoring and verifying.
    Polizeiposaunea day ago
    Unlike typical raid-5/6 parity, zfs / raidz doesn't require a format/parity initialization that writes all blocks of the disk before the pool can be used. You just need to write labels (at start & end of each disk) which is an attempt to confirm that the disk is actually as big as claimed.
    3 days ago
    undefined
  - CTDOCodebases3 days ago
    There are 36TB hard drives available.
    There are 122TB SSD drives now, though.
    UltraSane3 days ago
    Kioxia has a 245TB SSD.
- hugmynutus3 days ago
  Buddy, I have 24Tb HDDs in my pool today.
  If anything the opposite has occurred. HDD scaling has largely flattened. Going from 1986 -> 2014, HDD size increased by 10x every 5.3 years [1]. If anything we should have 100Tb+ drives if scaling kept going. I say this not as a but there have been directly implications for ZFS.
  All this data stuck behind an interface who's speed is (realistically after a file system & kernel involved) hard limited to 200MiB/s-300MiB/s. Recovery times sky rocket. As you simply cannot re-build parity/copy data. The whole reason stuff like draid [2] were created is so larger pools can recover in less than a day by doing sequential parity & hot-spairs loaded 1/N of each drives data ahead of time.
  ---
  1. Not the most reliable source, but it is a friday afternoon https://old.reddit.com/r/DataHoarder/comments/spoek4/hdd_cap...
  2. https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAI... for concept, for motivations & implementation details see -> https://www.youtube.com/watch?v=xPU3rIHyCTs
  - godelski3 days ago
    Not quite that level, but you can get 8TB nvmes. You'll pay $500 a pop though...[0]. Weirdly that's the cheapest NewEgg lists for anything above 8TB and even SSDs are more expensive. It's a gen4 PCIe M.2 but a SATA SSD is more? It's better going the next bracket down but still surprising to me that the cheapest 4TB SSD is just $20 cheaper than the cheapest NVMe[1] (a little more and you're getting recognizable names too!)
    It kinda sucks that things have flatlined a bit, but still cool that a lot of this has become way cheaper. I think the NVMes at these prices and sizes really makes caching a reasonable thing to do for consumer grade storage
    [0] https://www.newegg.com/western-digital-8tb-black/p/N82E16820...
    [1] https://www.newegg.com/p/pl?N=100011693%20600551612&Order=1
    necovek3 days ago
    In terms of production, SSD flash chips that go into SATA and NVMe drives can be pretty much the same: only the external interface can be different.
    The biggest cost driver for flash chips is not the speed they can be read from and written to in bursts, but how resilient they are (how many times can they be written over) and sustained speed (both based on the tech in use, TLC, SLC, MLC, 3D NAND, wear levelling logic...): even for SATA speeds, you need the very best for sustained throughput.
    Still, SATA SSDs make sense since they can use the full SATA bandwidth and have low latency compared to HDDs.
    So the (lack of) price difference is not really surprising.
    nick__m3 days ago
    If your budget allows it you can get 120TB .5 dwdp ssd like that drive http://www.atic.ca/index.php?page=details&psku=319207
    wtallis3 days ago
    > Weirdly that's the cheapest NewEgg lists for anything above 8TB and even SSDs are more expensive.
    Please don't perpetuate the weird misconception that "SSD" refers specifically to SATA SSDs and that NVMe SSDs aren't SSDs.
- LaurensBER2 days ago
  I just grabbed a deal on a 8GB / 2 vCore VPS for 15 euro per year. It's absolutely insane how cheap hardware has become.
  At my first job we paid 6 figures for a 256GB ram machine. Now I see homelabers grabbing old servers with that much memory for their hobby projects.
  - klinch2 days ago
    May I ask where you got such a cheap VPS?
    LaurensBER2 days ago
    https://lowendtalk.com/discussion/210253/deluxhost-net-j-ser...
    This offer is already gone but for sure there'll be a new one, VPS prices tend to trend downwards.
    I'm sure I'm comprising on reliability but there are going to be part of my hobby K8s cluster so a bit of downtime will make for a nice stress test ;)
    jodrellblank2 days ago
    I usually look at lowendbox.com and Google “racknerd Black Friday vps” the cheapest are in the 10-20/year region, but don’t have that much RAM.
    It sounds a very good deal - unless they mean 8GB SSD storage, then it’s a normal sort of deal.
zenmac3 days ago
Can someone here on hn with more in deepth knowelege about ZFS commenting on why it is superior to EXT4 for example for file storage? Does each dir handle more children for example?
Last time I read here HN ZFS still seem have edge case bugs. Has it matured now? Why don't distro such as debian etc just ship ZFS as the default instead of ext4?
- yjftsjthsd-h2 days ago
  > Can someone here on hn with more in deepth knowelege about ZFS commenting on why it is superior to EXT4 for example for file storage? Does each dir handle more children for example?
  I'll tell you why I use it: Built-in compression, and data checksums (so it can catch and potentially correct corruption). Both are extra useful on storage arrays.
  > Last time I read here HN ZFS still seem have edge case bugs. Has it matured now?
  The only significant bugs I've heard of in a long time are with encryption. Still not ideal, but not a show-stopper.
  > Why don't distro such as debian etc just ship ZFS as the default instead of ext4?
  (The following is an oversimplification; I'm not a lawyer, so take with grain of salt.) There is, or at least may be, a licensing conflict between ZFS under the CDDL and Linux under the GPLv2. Both good open source licenses, but it is at best unclear that it's legal to distribute binaries that mix code under the two licenses. This makes it at best really messy to distribute a Linux distro using ZFS. (The generally accepted solution is to compile on-device, since the problem only happens with binaries.)
  - zenmac2 days ago
    Thank you for pointing out about the CDDL license. Encryption is not a show-stopper.
- necovek3 days ago
  ZFS is a combination of actual file system like Ext4, LVM and MD (software raid) subsystems on Linux, with extra features on top which you unlock when these are a single system.
  Due to licensing, it can't be included in Linux kernel directly, and so it's not seeing the same level of scrutiny like the rest of the kernel, but it is arguably used in production more successfully than btrfs (comparable feature set, in-kernel, but not maintained as well anymore).
- ksk233 days ago
  I can only imagine the ZFS license is not free enough for Debian (not a rant).
  - em-bee3 days ago
    the problem is less how free it is. as far as i know it is free enough. but the license is simply incompatible with the GPLv2 and therefore a combined program may not be distributed, because by distributing it you would violate the GPL license of the kernel.
- davisr3 days ago
  Quit being a lazy bum and use a search engine to answer your questions.
  - zenmac2 days ago
    I don't know, I find HN is one place were we can actually engage real humans conversation on these type of tech related questions. Human dialog is often something that can results in more interesting discovery. Don't think you can get that with agenatic RL question answers feed back loop between models.
    rcxdude2 days ago
    It's not a particularly interesting set of questions, though. If you looked up the features of zfs you would find what it has the ext4 doesn't. There's hundreds of articles, written by humans, on it. There's nothing that someone is going to helpfully put in an HN reply to these questions that you couldn't find faster by literally copy and pasting them into google, ignoring the AI summary.
davekeck3 days ago
Surprised that folks were still using RealPlayer in 2011