My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.
https://insanity.industries/post/pareto-optimal-compression/
> bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.
I've just checked again with a 1GB SQL file. `bzip2 -9` shrinks it to 83MB. `zstd -19 --long` to 52MB.
Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.
It was long surpassed by lzma and zstd.
But back in roughly the 00s, it was the best standard for compression, because the competition was DEFLATE/gzip.
You can bzip2 -9 files in some source code directory and tar these .bz2 files. This would be more or less equivalent to creating ZIP archive with BWT compression method. Then you can compare result with tar-ing the same source directory and bzip2 -9 the resulting .tar.
Then you can compare.
The continuous mode in RAR was something back then, exactly because RAR had long LZ77 window and compressed files as continuous stream.
'Solid compression' (as WinRAR calls it) is still optional with RAR. I recall the default is 'off'. At the time, that mode was still pretty good compared to bzip2.
https://man.archlinux.org/man/pbzip2.1.en
And zstd is multi threaded from the beginning.
Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.
Awesome! Please let me know if there is anything I can do to help
Consider "bananarama":
"abananaram"
"amabananar"
"ananaramab"
"anaramaban"
"aramabanan"
"bananarama"
"mabananara"
"nanaramaba"
"naramabana"
"ramabanana"
The last symbols on each line get context from first symbols of the same line. It is so due to rotation.But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.
[1] https://en.wikipedia.org/wiki/Zipf%27s_law
Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.
In my own testing of compressing internal generic json blobs, I found brotli a clear winner when comparing space and time.
If I want higher compatibility and fast speeds, I'd probably just reach for gzip.
zstd is good for many use cases, too, perhaps even most...but I think just telling everyone to always use it isn't necessarily the best advice.
It’s slower and compresses less than zstd. gzip should only be reached for as a compatibility option, that’s the only place it wins, it’s everywhere.
EDIT: If you must use it, use the modern implementation, https://www.zlib.net/pigz/
size and decompression are the main limitations
Does gmail use a special codec for storing emails ?
Yes, I do. Zstd is my preferred solution nowadays. But gzip is not going anywhere as a fallback because there is a surprisingly high number of computers without a working libzstd.
https://cbloomrants.blogspot.com/2021/03/faster-inverse-bwt....
TIL. Now that's why gzip has a file header! But, tar.gz compresses even better, that's probably why it hasn't caught on.
It’s handy because it is very easy to just append stuff at the end of a compressed file without having to uncompress-append-recompress. It is a bit niche but I have a couple of use cases where it makes everything simpler.
that being said, speed is important for compression so for systems like webservers etc its an easy sell ofc. very strong point (and smarter implementation in programs) for gzip
Long comment to just say: ‘I have no idea about what I’m writing about’
These compression algorithms do not have anything to do with filesystem structure. Anyway the reason you can’t cat together parts of bzip2 but you can with zstd (and gzip) is because zstd does everything in frames and everything in those frames can be decompressed separately (so you can seek and decompress parts). Bzip2 doesn’t do that.
So like, another place bzip2 sucks ass is working with large archives because you need to seek the entire archive before you can decompress it and it makes situations without parity data way more likely to cause dataloss of the whole archive. Really, don’t use it unless you have a super specific use case and know the tradeoffs, for the average person it was great when we would spend the time compressing to save the time sending over dialup.
I think their argument is sound and it makes using bzip2 less useful in certain situations. I was once saved in resolving a problem we had when I figured out that concatening gzipped files just works out of the box. If not, it would have meant a bit more code, lots of additional testing, etc.
That should not happen too often, considering that IIRC bzip lasted only a couple of months before being replaced by bzip2.
https://github.com/facebook/zstd?tab=readme-ov-file#benchmar...
And, yes, I prefer LZMA over the obsolete Bzip2 any day, but GZIP it's like the ZIP of free formats modulo packaging, which it's the job of TAR.
I suggest implementing Scott's Bijective Burrows-Wheeler variant on bits rather than bytes, and do bijective run-length encoding of the resulting string. It's not exactly on the "pareto frontier", but it's fun!
Yes, there are better compression options today.