> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.
Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)
I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.
Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.
We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.
Happy to discuss further
I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.
Not today. However, we are considering this as we are continuing to evolve the frame format, and it is likely we will add this feature in the future.
This takes a very different approach, and wouldn't require a full WASM runtime. Though it does have the SDDL compiler and runtime, though I assume it's a lighter dependency.
[1]: https://news.ycombinator.com/item?id=45437759 F3: Open-source data file format for the future [pdf] (125 comments)
Thinking about that, you may have been confused why I said it's reasonable to avoid WebAssembly for that. I meant that a full Turing-complete execution might not be necessary if that makes it easier to ensure the correctness; OpenZL graphs are not even close to a Turing-complete language for example.
However, we are talking about an arbitrary decompressor here. The decompressor WASM is sandboxed from the outside world and it can't wreak havoc on your system, true, but nothing stops it from producing a malicious uncompressed file from a known good compressed file.
If the compressed file is malicious, it doesn't matter whether it's malicious because it originated from a malicious uncompressed file, or is malicious because it originated from a benign uncompressed file and the transformation into a compressed file introduces the malicious parts due to the bundled custom decompressor.
When the data container is understood, the deduplication is far more efficient because now it is targeted.
Licensed as BSD-3-Clause, solid C++ implementation, well documented.
Will be looking forward to see new developments as more file formats are contributed.
Like you mention, the expandability is quite something. In a few years we might see a very capable compressor.
Honestly looks incredible. Could be amazing to provide a general framework for compressing custom format.
Unclear if this has enough "structure" for OpenZL.
Additionally, it works well on numeric data in native format. But JSON stores it in ASCII. We can transform ASCII integers into int64 data losslessly, but it is very hard to transform ASCII floats into doubles losslessly and reliably.
However, given the work to parse the data (and/or massage it to a more friendly format), I would expect that OpenZL would work very well. Highly repetitive, numeric data with a lot of structure is where OpenZL excels.
This tends to confuse generic compressors, even though the sub-byte data itself usually clusters around the smaller lengths for most data and thus can be quite repetitive (plus it's super efficient to encode/decode). Could this be described such that OpenZL can capitalize on it?
We developed OpenZL initially for our own consumption at Meta. More recently we've been putting a lot of effort into making this a usable tool for people who, you know, didn't develop OpenZL. Your feedback is welcome!
``` src/openzl/codecs/dispatch_string/encode_dispatch_string_binding.c:74: EI_dispatch_string: splitting 48000001 strings into 14 outputs OpenZL Library Exception: OpenZL error code: 55 OpenZL error string: Input does not respect conditions for this node OpenZL error context: Code: Input does not respect conditions for this node Message: Check `eltWidth != 2' failed where: lhs = (unsigned long) 4 rhs = (unsigned long) 2
Graph ID: 5 Stack Trace: #0 doEntropyConversion (src/openzl/codecs/entropy/encode_entropy_binding.c:788): Check `eltWidth != 2' failed where: lhs = (unsigned long) 4 rhs = (unsigned long) 2
#1 EI_entropyDynamicGraph (src/openzl/codecs/entropy/encode_entropy_binding.c:860): Forwarding error: #2 CCTX_runGraph_internal (src/openzl/compress/cctx.c:770): Forwarding error: #3 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #4 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #5 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #6 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #7 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #8 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #9 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #10 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #11 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #12 CCTX_runSuccessors (src/openzl/compress/cctx.c:707): Forwarding error: #13 CCTX_runSuccessor_internal (src/openzl/compress/cctx.c:1149): Forwarding error: #14 CCTX_startCompression (src/openzl/compress/cctx.c:1276): Forwarding error: #15 CCTX_compressInputs_withGraphSet_stage2 (src/openzl/compress/compress2.c:116): Forwarding error: ```
On the other hand the default CSV profile didn't seem that great either, the CSV file was 349 MB and it compressed it down to 119MB while a ZIP file of the CSV is 105MB.
Any plans to make it so one format can reference another format? Sometimes data of one type occurs within another format, especially with archive files, media container files, and disk images.
So, for example, suppose someone adds a JSON format to OpenZL. Then someone else adds a tar format. While parsing a tar file, if it contains foo.json, there could be some way of saying to OpenZL, "The next 1234 bytes are in the JSON format." (Maybe OpenZL's frames would allow making context shifts like this?)
A related thing that would also be nice is non-contiguous data. Some formats include another format but break up the inner data into blocks. For example, a network capture of a TCP stream would include TCP/IP headers, but the payloads of all the packets together constitute another stream of data in a certain format. (This might get memory intensive, though, since there's multiplexing, so you may need to maintain many streams/contexts.)
Specifically the dictionary + delta-encoded + huffman'd index lists method mentioned in TFA, is commonly used for compressing weights. Weights tend to be sparse, but clustered, meaning most offsets are small numbers with the occasional jump, which is great for huffman.
So OpenZL is significantly better than zstd, but worse than flac.
We actually worked on a demo WAV compressor a while back. We are currently missing codecs to run the types of predictors that FLAC runs. We expect to add this kind of functionality in the future, in a generic way that isn't specific to audio, and can be used across a variety of domains.
But, generally we wouldn't expect to generally beat FLAC. But, be able to offer specialized compressors for many types of data that previously weren't important enough to spawn a whole field of specialized compressors, by significantly lowering the bar for entry.
test.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 44100 Hz
https://gist.github.com/pmarks-net/64c17aff45e7741f07eeb5dd0...
I am pumped to see this. Thanks for sharing.
Invalid argument(s):
No compressor profile or serialized compressor specified.
Same thing for the `train` command.Edit: @terrelln Got it, thank you!
https://openzl.org/getting-started/quick-start/
However, OpenZL is different in that you need to tell the compressor how to compress your data. The CLI tool has a few builtin "profiles" which you can specify with the `--profile` argument. E.g. csv, parquet, or le-u64. They can be listed with `./zli list-profiles`.
You can always use the `serial` profile, but because you haven't told OpenZL anything about your data, it will just use Zstandard under the hood. Training can learn a compressor, but it won't be able to learn a format like `.tar` today.
If you have raw numeric data you want to throw at it, or Parquets or large CSV files, thats where I would expect OpenZL to perform really well.
It could be a problem that is well-suited to machine learning, as there is a clear objective function: Did compression succeed, and if so what is the compressed size.
Down the line, we expect to improve this representation to shrink it further, which is important for small data. And to allow to move this representation, or parts of it, into a dictionary, for tiny data.
[0] https://github.com/facebook/openzl/blob/d1f05d0aa7b8d80627e5...
The output is something like {precomp header}{gzip parameters}{original uncompressed data} which you can then feed to a stronger compressor.
A major use case is if you have a lot of individually gzipped archives with similar internal content, you can precomp them and then use long-range solid compression over all your archives together for massive space savings.
Or even a single gzipped archive with similar pieces of content that are more than 32KB apart.
The difference with OpenZL IIUC seems to be that it has some language that can flexibly describe a family of transformations, which can be serialized and included with the compressed data for the decoder to use. So instead of choosing between a fixed set of transformations built into the decoder ahead of time, as in PNG, you can apply arbitrary transformations (as long as they can be represented in their format).
On highly structured data where OpenZL is able to understand the format, it blows Zstandard and Xz out of the water. However, not all data fits this bill.
Are the compression speed chart all like-for-like in terms of what is hw accelerated vs not?
Code: https://github.com/facebook/openzl
Documentation: https://openzl.org/
White Paper: https://arxiv.org/abs/2510.03203
You mentioned something about grid structured data being in the plans - can you give more details?
Have you done experiments with compressing BCn GPU texture formats? They have a peculiar branched structure, with multiple sub formats packed tightly in bitfields of 64- or 128-bit blocks; due to the requirement of fixed ratio and random access by the GPU they still leave some potential compression on the table.