Jennifer Aniston and Friends Cost Us 377GB and Broke Ext4 Hardlinks(blog.discourse.org)

36 pointsby speckx6 hours ago7 comments

replooda5 hours ago
In short: Deduplication efforts frustrated by hardlink limits per inode — and a solution compatible with different file systems.
- UltraSane4 hours ago
  The real problem is they aren't deduplicating at the filesystem level like sane people do.
  - otterley3 hours ago
    From the article:
    > [W]e shipped an optimization. Detect duplicate files by their content hash, use hardlinks instead of downloading each copy.
    UltraSane3 hours ago
    I meant TRANSPARENT filesystem level dedupe. They are doing it at the application level. filesystem level dedupe makes it impossible to store the same file more than once and doesn't consume hardlinks for the references. It is really awesome.
    mmh00003 hours ago
    Filesystem/file level dedupe is for suckers. =D
    If the greatest filesystem in the world were a living being, it would be our God. That filesystem, of course, is ZFS.
    Handles this correctly:
    https://www.truenas.com/docs/references/zfsdeduplication/
    UltraSane3 hours ago
    I was talking about block level dedupe.
    mmh00003 hours ago
    I thought you might be.
    I just wanted to mention ZFS.
    Have I mentioned how great ZFS is yet?
    otterley21 minutes ago
    ZFS is great! However, it's too complicated for most Linux server use cases (especially with just one block device attached); it's not the default (root filesystem); and it's not supported for at least one major enterprise Linux distro family.
uticus5 hours ago
And I thought this was a reference to a Win95 problem https://www.slashgear.com/1414245/jennifer-aniston-matthew-p...
- mingus884 hours ago
  Yeah Block level dedupe has been an industry standard for decades. Tracking file hashes? Why?
  And I see above that this is a self-hosted platform and I still don’t get it. I was running terabytes of ZFS with dedupe=on on cheap supermicro gear in 2012
  - zulux3 hours ago
    File hashes are great to get two systems to work together to dedupe themselves. I have a Windows backup that sends hashes to a backup server, so we don't back up crud we already have.
trixn864 hours ago
The Problem. The fix. The Limit.
Is it just me or is everybody else just as fed up with always the same AI tropes?
I've reached a point where I just close the tab the moment I read a headline "The problem". At least use tropes.fyi please
- colejohnson662 hours ago
  Doesn’t read like AI to me
- snickerbockersan hour ago
  Let that sink in.
dj_rock5 hours ago
We were on a break...of your filesystem!
otterley3 hours ago
Another reason to use XFS -- it doesn't have per-inode hard link limits.
(Some say ZFS as well, but it's not nearly as easy to use, and its license is still not GPL-friendly.)
bravetraveler5 hours ago
As is always the case, short vs long term... but I think I'd put effort into migrating to a filesystem that is aware of duplication instead of trying to recreate one with links [while retaining duplicates, just fewer].
Effectiveness is debatable, this approach still has duplication. An insignificant amount, I'll admit. The filesystem handling this at the block level is probably less problematic/prone to rework and more efficient.
edit: Eh, ignore me. I see this is preparing for [whatever filesystem hosts chose] thanks to 'ameliaquining' below. Originally thought this was all Discourse-proper, processing data they had.
- ameliaquining4 hours ago
  Discourse is self-hostable; they can't require their users to use a filesystem that supports deduplication. (Or, well, they could, but it would greatly complicate installation and maintenance and whatnot, and also there would need to be some kind of story for existing installations.)
  - bravetraveler4 hours ago
    Fair, I am/was confused by the hosting model and presentation. This is a nice User-preparation/consideration, I guess. I still maintain a backup filesystem unaware of duplication at the block level is a mistake.
    I completely overlooked the shipping-of-tarballs. Links make sense, here. I had 'unpacked' and relatively-local data in mind. Absolutely would not go as far to suggest their scheme pick up 'zfs {send,receive}'/equivalent, lol.
- mikehotel4 hours ago
  [dead]
UltraSane4 hours ago
This makes them look rather incompetent. Storing the exact same file 246,173 times is just stupid. Dedupe at the filesystem level and make your life easier.