“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.
All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.
Brotli w/o a custom dictionary is a weird choice to begin with.
That said, I personally prefer zstd as well, it's been a great general use lib.
zstd is Pareto better than brotli - compresses better and faster
brotli 1.0.7 args: -q 11 -w 24
zstd v1.5.0 args: --ultra -22 --long=31
| Original | zstd | brotli
RandomBook.pdf | 15M | 4.6M | 4.5M
Invoice.pdf | 19.3K | 16.3K | 16.1K
I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.
I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044
Results by compression type across 55 PDFs:
+------+------+-----+------+--------+
| none | zstd | xz | gzip | brotli |
+------|------|-----|------|--------|
| 47M | 45M | 39M | 38M | 37M |
+------+------+-----+------+--------+ pdftk in.pdf output out.pdf decompressthat data in pdf files are noisy and zstd should perform better on noisy files?
Something like this:
https://developer.chrome.com/blog/shared-dictionary-compress...
In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.
So it might land in the spec once it has proven if offers enough value
The standard Brotli dictionary bakes in a ton of assumptions about what the Web looked like in 2015, including not just which HTML tags were particularly common but also such things as which swear words were trendy.
It doesn't seem reasonable to think that PDFs have symbol probabilities remotely similar to the web corpus Google used to come up with that dictionary.
On top of that, it seems utterly daft to be baking that into a format which is expected to fit archival use cases and thus impose that 2015 dictionary on PDF readers for a century to come.
I too would strongly prefer that they use zstd.
The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.
Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.
Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.
But reading the article I realized PDFs have become ubiquitous because of its insistence on backwards compatibility. Maybe for some things it's good to move this slow.
Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.
I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.
I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.
- when jumping from page to page, you won’t have to decompress the entire file
"Brotli is a compression algorithm developed by Google."
They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.
Sheer incompetence.
I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.
I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:
+------+------+-----+------+--------+
| none | zstd | xz | gzip | brotli |
+------|------|-----|------|--------|
| 47M | 45M | 39M | 38M | 37M |
+------+------+-----+------+--------+
Here's a table with all the files: +------+------+------+------+--------+
| raw | zstd | xz | gzip | brotli |
+------+------+------+------+--------+
| 12K | 12K | 12K | 12K | 12K |
| 20K | 20K | 20K | 20K | 20K | x5
| 24K | 20K | 20K | 20K | 20K | x5
| 28K | 24K | 24K | 24K | 24K |
| 28K | 24K | 24K | 24K | 24K |
| 32K | 20K | 20K | 20K | 20K | x3
| 32K | 24K | 24K | 24K | 24K |
| 40K | 32K | 32K | 32K | 32K |
| 44K | 40K | 40K | 40K | 40K |
| 44K | 40K | 40K | 40K | 40K |
| 48K | 36K | 36K | 36K | 36K |
| 48K | 48K | 48K | 48K | 48K |
| 76K | 128K | 72K | 72K | 72K |
| 84K | 140K | 84K | 80K | 80K | x7
| 88K | 136K | 76K | 76K | 76K |
| 124K | 152K | 88K | 92K | 92K |
| 124K | 152K | 92K | 96K | 92K |
| 140K | 160K | 100K | 100K | 100K |
| 152K | 188K | 128K | 128K | 132K |
| 188K | 192K | 184K | 184K | 184K |
| 264K | 256K | 240K | 244K | 240K |
| 320K | 256K | 228K | 232K | 228K |
| 440K | 448K | 408K | 408K | 408K |
| 448K | 448K | 432K | 432K | 432K |
| 516K | 384K | 376K | 384K | 376K |
| 992K | 320K | 260K | 296K | 280K |
| 1.0M | 2.0M | 1.0M | 1.0M | 1.0M |
| 1.1M | 192K | 192K | 228K | 200K |
| 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |
| 1.2M | 1.1M | 1.0M | 1.0M | 1.0M |
| 1.3M | 2.0M | 1.1M | 1.1M | 1.1M |
| 1.7M | 2.0M | 1.7M | 1.7M | 1.7M |
| 1.9M | 960K | 896K | 952K | 916K |
| 2.9M | 2.0M | 1.3M | 1.4M | 1.4M |
| 3.2M | 4.0M | 3.1M | 3.1M | 3.0M |
| 3.7M | 4.0M | 3.5M | 3.5M | 3.5M |
| 6.4M | 4.0M | 4.1M | 3.7M | 3.5M |
| 6.4M | 6.0M | 6.1M | 5.8M | 5.7M |
| 9.7M | 10M | 10M | 9.5M | 9.4M |
+------+------+------+------+--------+
Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.
I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p
Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?
If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.
I can reproduce it just fine ... but only when compressing all PDFs simultaneously.
To utilize all cores, I ran:
$ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait
(and similar for the other formats).I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:
$ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done
That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?
--ultra: unlocks high compression levels 20+ (maximum 22), using a lot more memory.
Regardless, this reproduces with random other files and with '-9' as the compression level. I made a mastodon post about it here: https://floss.social/@mort/115940378643840495If you do wanna change PDF backwards-incompatibly, I don't think there's a significant advantage to choosing gzip to be honest, both brotli and zstd are pretty widely available these days and should be fairly easy to vendor. But yeah, it's a slight advantage I guess. Though I would expect that there are other PDF data sets where brotli has a larger advantage compared to gzip.
But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)
~/tmp/pdfbench $ hyperfine --warmup 2 \
'for x in zst/*; do zstd -d >/dev/null <"$x"; done' \
'for x in gz/*; do gzip -d >/dev/null <"$x"; done' \
'for x in xz/*; do xz -d >/dev/null <"$x"; done' \
'for x in br/*; do brotli -d >/dev/null <"$x"; done'
Benchmark 1: for x in zst/*; do zstd -d >/dev/null <"$x"; done
Time (mean ± σ): 164.6 ms ± 1.3 ms [User: 83.6 ms, System: 72.4 ms]
Range (min … max): 162.0 ms … 166.9 ms 17 runs
Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
Time (mean ± σ): 143.0 ms ± 1.0 ms [User: 87.6 ms, System: 43.6 ms]
Range (min … max): 141.4 ms … 145.6 ms 20 runs
Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
Time (mean ± σ): 981.7 ms ± 1.6 ms [User: 891.5 ms, System: 93.0 ms]
Range (min … max): 978.7 ms … 984.3 ms 10 runs
Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
Time (mean ± σ): 254.5 ms ± 2.5 ms [User: 172.9 ms, System: 67.4 ms]
Range (min … max): 252.3 ms … 260.5 ms 11 runs
Summary
for x in gz/*; do gzip -d >/dev/null <"$x"; done ran
1.15 ± 0.01 times faster than for x in zst/*; do zstd -d >/dev/null <"$x"; done
1.78 ± 0.02 times faster than for x in br/*; do brotli -d >/dev/null <"$x"; done
6.87 ± 0.05 times faster than for x in xz/*; do xz -d >/dev/null <"$x"; done
As expected, xz is super slow. Gzip is fastest, zstd being somewhat slower, brotli slower again but still much faster than xz. +-------+-------+--------+-------+
| gzip | zstd | brotli | xz |
+-------+-------+--------+-------+
| 143ms | 165ms | 255ms | 982ms |
+-------+-------+--------+-------+
I honestly expected zstd to win here.Regardless, this does not make a significant difference. I ran hyperfine again against a 37M folder of .pdf.zst files, and the results are virtually identical for zstd and gzip:
+-------+-------+--------+-------+
| gzip | zstd | brotli | xz |
+-------+-------+--------+-------+
| 142ms | 165ms | 269ms | 994ms |
+-------+-------+--------+-------+
Raw hyperfine output: ~/tmp/pdfbench $ du -h zst2 gz xz br
37M zst2
38M gz
38M xz
37M br
~/tmp/pdfbench $ hyperfine ...
Benchmark 1: for x in zst2/*; do zstd -d >/dev/null <"$x"; done
Time (mean ± σ): 164.5 ms ± 2.3 ms [User: 83.5 ms, System: 72.3 ms]
Range (min … max): 162.3 ms … 172.3 ms 17 runs
Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
Time (mean ± σ): 142.2 ms ± 0.9 ms [User: 87.4 ms, System: 43.1 ms]
Range (min … max): 140.8 ms … 143.9 ms 20 runs
Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
Time (mean ± σ): 993.9 ms ± 9.2 ms [User: 896.7 ms, System: 99.1 ms]
Range (min … max): 981.4 ms … 1007.2 ms 10 runs
Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
Time (mean ± σ): 269.1 ms ± 8.8 ms [User: 176.6 ms, System: 75.8 ms]
Range (min … max): 261.8 ms … 287.6 ms 10 runshttps://news.ycombinator.com/item?id=46035817
I’m also really surprised that gzip performs better here. Is there some kind of hardware acceleration or the like?
> Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.
Maybe read things a bit more carefully before going all out on the snide comments?
>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."
If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.
The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.
ISO is pay to play so :shrug:
So your comment is a falsehood
https://pdfa.org/brotli-compression-coming-to-pdf/
> As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.
> Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.
MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.
It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).