An ode to bzip(purplesyringa.moe)

85 pointsby signa118 hours ago14 comments

saghm6 hours ago
Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.
My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.
- 0cf8612b2e1e4 hours ago
  There was an analysis which argued that zstd is pareto optimal.
  https://insanity.industries/post/pareto-optimal-compression/
idoubtit4 hours ago
My experience does not match theirs when compressing text and code:
> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.
I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.
Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.
- cogman103 hours ago
  bzip is old and slow.
  It was long surpassed by lzma and zstd.
  But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.
  - duskwuff3 hours ago
    Also potentially relevant: in the 00s, the performance gap between gzip and bzip2 wasn't quite as wide - gzip has benefited far more from modern CPU optimizations - and slow networks / small disks made a higher compression ratio more valuable.
  - yyyk3 hours ago
    Even then, there were better options in the Windows world (RAR/ACE/etc.). Also, bzip2 was considered slow even when it was new.
    thesz2 hours ago
    RAR/ACE/etc used continuous compression - all files were concatenated and compressed as if they were one single large file. Much like what is done with .tar.bz. Bzip on Windows did not do that, there was no equivalent of .tar.bz2 on Windows.
    You can bzip2 -9 files in some source code directory and tar these .bz2 files. This would be more or less equivalent to creating ZIP archive with BWT compression method. Then you can compare result with tar-ing the same source directory and bzip2 -9 the resulting .tar.
    Then you can compare.
    The continuous mode in RAR was something back then, exactly because RAR had long LZ77 window and compressed files as continuous stream.
    yyykan hour ago
    >continuous compression
    'Solid compression' (as WinRAR calls it) is still optional with RAR. I recall the default is 'off'. At the time, that mode was still pretty good compared to bzip2.
- 8n4vidtmkvmk2 hours ago
  And here i got best compression out of xz for SQL.
eichinan hour ago
Interesting detail on the algorithm but seems to completely miss that if you care about non-streaming performance, there are parallel versions of xz and gzip (pxzip encodes compatible metadata about the breakup points so that while xz can still decompress it, pxzip can use as many cores as you let it have instead.) Great for disk-image OS installers (the reason I was benchmarking it in the first place - but this was about 5 years back, I don't know if those have gotten upstreamed...)
- shawn_w9 minutes ago
  There's also a parallel version of bzip2, pbzip2.
  https://man.archlinux.org/man/pbzip2.1.en
  And zstd is multi threaded from the beginning.
fl0ki6 hours ago
This seems as good a thread as any to mention that the gzhttp package in klauspost/compress for Go now supports zstd on both server handlers and client transports. Strangely this was added in a patch version instead of a minor version despite both expanding the API surface and changing default behavior.
https://github.com/klauspost/compress/releases/tag/v1.18.4
- klauspost5 hours ago
  About the versioning, glad you spotted it anyway. There isn't as much use of the gzhttp package compared to the other ones, so the bar is a bit higher for that one.
  Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.
  - fl0ki5 hours ago
    Yeah, I make it a habit to read the changelogs of every update to every direct dependency. I was anticipating this change for years, thanks for doing it!
  - terrelln4 hours ago
    > Also making good progress on getting a slimmer version of zstd into the stdlib
    Awesome! Please let me know if there is anything I can do to help
    klauspost3 hours ago
    https://github.com/klauspost/stdgozstd
thesz2 hours ago
BWT is a prediction by partial match (PPM) in disguise.
Consider "bananarama":
```
  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"
```
The last symbols on each line get context from first symbols of the same line. It is so due to rotation.
But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.
[1] https://en.wikipedia.org/wiki/Zipf%27s_law
Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.
hexxagone6 hours ago
Notice that bzip3 has close to nothing to do with bzip2. It is a different BWT implementation with a different entropy codec, from a different author (as noted in the GitHub description "better and stronger spiritual successor to BZip2").
3 hours ago
undefined
joecool10296 hours ago
Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.
- hexxagone6 hours ago
  In the LZ high compression regime where LZ can compete in terms of ratio, BWT compressors are faster to compress and slower to decompress than LZ codecs. BWT compressors are also more amenable to parallelization (check bsc and kanzi for modern implementations besides bzip3).
- silisili6 hours ago
  I'd argue it's more workload dependent, and everything is a tradeoff.
  In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.
  If I want higher compatibility and fast speeds, I'd probably just reach for gzip.
  zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.
  - joecool10296 hours ago
    > If I want higher compatibility and fast speeds, I'd probably just reach for gzip.
    It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.
    EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/
- NooneAtAll34 hours ago
  why would one even care about compression speed on minecraft ComputerCraft machine?
  size and decompression are the main limitations
pella6 hours ago
imho: the future is a specialized compressor optimized for your specific format. ( https://openzl.org/ , ... )
- cgag4 hours ago
  This seems very cool. Was going to suggest submitting it, but I see there was a fairly popular thread 5 months ago for anyone interested: https://news.ycombinator.com/item?id=45492803
- srean5 hours ago
  That is an interesting link.
  Does gmail use a special codec for storing emails ?
  - duskwuff2 hours ago
    The biggest savings for a service like GMail are going to be based around deduplication - e.g. if you can recognize that a newsletter went out to a thousand subscribers and store those all as deltas from a "canonical" copy - congratulations, that's >1000:1 compression, better than you could achieve with any general-purpose compression. Similarly, if you can recognize that an email is an Amazon shipping confirmation or a Facebook message notification or some other commonly repeated "form letter", you can achieve huge savings by factoring out all the common elements in them, like images or stylesheets.
    dataflowan hour ago
    I kind of doubt they would do this to be honest. Every near-copy of a message is going to have small differences in at least the envelope (not sure if encoding differences are also possible depending on the server), and possibly going to be under different guarantees or jurisdictions. And it would just take one mistake to screw things up and leak data from one person to another. All for saving a few gigabytes over an account's lifetime. Doesn't really seem worth it, does it?
elophanto_agent7 hours ago
bzip2 is the compression algorithm equivalent of that one coworker who does incredible work but nobody ever talks about. meanwhile gzip gets all the credit because it's "good enough"
- kergonath6 hours ago
  Bzip2 is slow. That’s the main issue. Gzip is good enough and much faster. Also, the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files is not a deal breaker, but it is annoying.
  - nine_k6 hours ago
    Gzip is woefully old. Its only redeeming value is that it's already built into some old tools. Otherwise, use zstd, which is better and faster, both at compression and decompression. There's no reason to use gzip in anything new, except for backwards compatibility with something old.
    duskwuff3 hours ago
    One other redeeming quality that gzip/deflate does have is that its low memory requirements (~32 KB per stream). If you're running on an embedded device, or if you're serving a ton of compressed streams at the same time, this can be a meaningful benefit.
    kergonath6 hours ago
    > Otherwise, use zstd, which is better and faster
    Yes, I do. Zstd is my preferred solution nowadays. But gzip is not going anywhere as a fallback because there is a surprisingly high number of computers without a working libzstd.
  - duskwuff4 hours ago
    bzip2 is particularly slow because the transform it depends on (BWT2) is "intrinsically slow" - it depends on cache-unfriendly operations with long dependency chains, preventing the CPU from extracting any parallelism:
    https://cbloomrants.blogspot.com/2021/03/faster-inverse-bwt....
  - sedatk5 hours ago
    > the fact that you cannot get a valid bzip2 file by cat-ing 2 compressed files
    TIL. Now that's why gzip has a file header! But, tar.gz compresses even better, that's probably why it hasn't caught on.
    pocksuppet5 hours ago
    tar packs multiple files into one. If you concatenate two gzipped files and unzip them, you just get a concatenated file.
    sedatk5 hours ago
    Ah okay, I thought gzip would support decompressing multiple files that way.
    kergonath3 hours ago
    How it works is, if you have two files foo.gz and bar.gz, and cat foo.gz bar.gz > foobar.gz, then foobar.gz is a valid gzip file and uncompresses to a single file with the contents of foo and bar.
    It’s handy because it is very easy to just append stuff at the end of a compressed file without having to uncompress-append-recompress. It is a bit niche but I have a couple of use cases where it makes everything simpler.
    bmachoan hour ago
    tar supports that types of concatenation, so you can concatenate tar.gz files, and unpack them all into separate files
  - saidnooneever6 hours ago
    the catting issue might be more an implementation of bzip program problem than algorithm (it could expect an array of compressed files). that would only be impossible if the program cannot reason about the length of data from file header, which again is technically not something about compression algo but rather file format its carried through.
    that being said, speed is important for compression so for systems like webservers etc its an easy sell ofc. very strong point (and smarter implementation in programs) for gzip
    nine_k6 hours ago
    Bzip2 is great for files that are compressed once, get decompressed many times, and the size is important. A good example is a software release.
    pocksuppet5 hours ago
    So is xz, or zstd, and the files are smaller. bzip2 disappeared from software releases when xz was widely available. gzip often remains, as the most compatible option, the FAT32 of compression algorithms.
    joecool10296 hours ago
    > the catting issue might be more an implementation of bzip program problem than algorithm (it could expect an array of compressed files). that would only be impossible if the program cannot reason about the length of data from file header, which again is technically not something about compression algo but rather file format its carried through.
    Long comment to just say: ‘I have no idea about what I’m writing about’
    These compression algorithms do not have anything to do with filesystem structure. Anyway the reason you can’t cat together parts of bzip2 but you can with zstd (and gzip) is because zstd does everything in frames and everything in those frames can be decompressed separately (so you can seek and decompress parts). Bzip2 doesn’t do that.
    So like, another place bzip2 sucks ass is working with large archives because you need to seek the entire archive before you can decompress it and it makes situations without parity data way more likely to cause dataloss of the whole archive. Really, don’t use it unless you have a super specific use case and know the tradeoffs, for the average person it was great when we would spend the time compressing to save the time sending over dialup.
    prerokan hour ago
    So... it's actually a reasonable objection over bzip2? I mean, you explained why it does not work with bzip2.
    I think their argument is sound and it makes using bzip2 less useful in certain situations. I was once saved in resolving a problem we had when I figured out that concatening gzipped files just works out of the box. If not, it would have meant a bit more code, lots of additional testing, etc.
  - stefan_6 hours ago
    bzip and gzip are both horrible, terribly slow. Wherever I see "gz" or "bz" I immediately rip that nonsense out for zstd. There is such a thing as a right choice, and zstd is it every time.
    kergonath2 hours ago
    > Wherever I see "gz" or "bz"
    That should not happen too often, considering that IIRC bzip lasted only a couple of months before being replaced by bzip2.
    laurencerowe5 hours ago
    lz4 can still be the right choice when decompression speed matters. It's almost twice as fast at decompression with similar compression ratios to zstd's fast setting.
    https://github.com/facebook/zstd?tab=readme-ov-file#benchmar...
    anthk3 hours ago
    pigz it's damn fast on compressing. Also, a Vax with NetBSD can run gzip. So here is it. Go try these new fancy formats on a Vax, I dare you.
    And, yes, I prefer LZMA over the obsolete Bzip2 any day, but GZIP it's like the ZIP of free formats modulo packaging, which it's the job of TAR.
- singpolyma32 hours ago
  Neither has been good enough for years.
vintermann2 hours ago
If you're implementing it for Computercraft anyway, there's no reason to stick to the standard. It's well known that bzip2 has a couple of extra steps which don't improve compression ratio at all.
I suggest implementing Scott's Bijective Burrows-Wheeler variant on bits rather than bytes, and do bijective run-length encoding of the resulting string. It's not exactly on the "pareto frontier", but it's fun!
Grom_PE5 hours ago
PPMd (of 7-Zip) would beat BZip2 for compressing plain text data.
cobbzilla3 hours ago
My first exposure to bzip: The first Linux kernels I ever compiled & built myself (iirc ~v2.0.x), I packed as .tar.bz2 images. Ah the memories.
Yes, there are better compression options today.
comet_browser4 hours ago
[flagged]
- nvme0n1p14 hours ago
  ai slop account