ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss(pdfa.org)

169 pointsby whizzx16 days ago15 comments

ericpauley16 days ago
Some real cognitive dissonance in this article…
“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.
All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.
- xxs16 days ago
  yup, zstd is better. Overall use zstd for pretty much anything that can benefit from a general purpose compression. It's a beyond excellent library, tool, and an algorithm (set of).
  Brotli w/o a custom dictionary is a weird choice to begin with.
  - adzm16 days ago
    Brotli makes a bit of sense considering this is a static asset; it compresses somewhat more than zstd. This is why brotli is pretty ubiquitous for precompressed static assets on the Web.
    That said, I personally prefer zstd as well, it's been a great general use lib.
    dist-epoch16 days ago
    You need to crank up zstd compression level.
    zstd is Pareto better than brotli - compresses better and faster
    atiedebee16 days ago
    I thought the same, so I ran brotli and zstd on some PDFs I had laying around.
    brotli 1.0.7 args: -q 11 -w 24 zstd v1.5.0 args: --ultra -22 --long=31 | Original | zstd | brotli RandomBook.pdf | 15M | 4.6M | 4.5M Invoice.pdf | 19.3K | 16.3K | 16.1K
    I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.
    Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.
    mort9616 days ago
    EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158
    I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044
    Results by compression type across 55 PDFs:
    +------+------+-----+------+--------+ | none | zstd | xz | gzip | brotli | +------|------|-----|------|--------| | 47M | 45M | 39M | 38M | 37M | +------+------+-----+------+--------+
    mort9616 days ago
    Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.
    Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):
    +---------+---------+--------+--------+--------+ | none | zstd | xz | gzip | brotli | +---------|---------|--------|--------|--------| | 47.81M | 37.92M | 37.96M | 38.80M | 37.06M | +---------+---------+--------+--------+--------+
    These numbers are much more impressive. Still, Brotli has a slight edge.
    tracker115 days ago
    Worth considering the compress/decompress overhead, which is also lower in brotli than zstd from my understanding.
    Also, worth testing zopfli since it's decompression is gzip compatible.
    mrspuratic16 days ago
    > I couldn't quickly find a way to decompress them
    pdftk in.pdf output out.pdf decompress
    Thoreandan16 days ago
    Does your source .pdf material have FlateDecode'd chunks or did you fully uncompress it?
    atiedebee15 days ago
    I wasn't sure. I just went in with the (probably faulty) assumption that if it compresses to less than 90% of the original size that it had enough "non-randomness" to compare compression performance.
    atiedebee15 days ago
    Ran the tests again with some more files, this time decompressing the pdf in advance. I picked some widely available PDFs to make the experiment reproducable.
    file | raw | zstd (%) | brotli (%) | gawk.pdf | 8.068.092 | 1.437.529 (17.8%) | 1.376.106 (17.1%) | shannon.pdf | 335.009 | 68.739 (20.5%) | 65.978 (19.6%) | attention.pdf | 24.742.418 | 367.367 (1.4%) | 362.578 (1.4%) | learnopengl.pdf | 253.041.425 | 37.756.229 (14.9%) | 35.223.532 (13.9%) |
    For learnopengl.pdf I also tested the decompression performance, since it is such a large file, and got the following (less surprising) results using 'perf stat -r 5':
    zstd: 0.4532 +- 0.0216 seconds time elapsed ( +- 4.77% ) brotli: 0.7641 +- 0.0242 seconds time elapsed ( +- 3.17% )
    The conclusion seems to be consistent with what brotli's authors have said: brotli achieves slightly better compression, at the cost of a little over half the decompression speed.
    order-matters16 days ago
    Whats the assumption we can potentially target as reason for the counter-intuitive result?
    that data in pdf files are noisy and zstd should perform better on noisy files?
    jeffbee16 days ago
    What's counter-intuitive about this outcome?
    order-matters16 days ago
    maybe that was too strongly worded but there was an expectation for zstd to outperform. So the fact it didnt means the result was unexpected. i generally find it helpful to understand why something performs better than expected.
    mort9616 days ago
    Isn't zstd primarily designed to provide decent compression ratios at amazing speeds? The reason it's exciting is mainly that you can add compression to places where it didn't necessarily make sense before because it's almost free in terms of CPU and memory consumption. I don't think it has ever had a stated goal of beating compression ratio focused algorithms like brotli on compression ratio.
    sgerenser16 days ago
    I actually thought zstd was supposed to be better than Brotli in most cases, but a bit of searching reveals you're right... Brotli, especially at the highest compression levels (10/11), often exceeds zstd at the highest compression levels (20-22). Both are very slow at those levels, although perfectly suitable for "compress once, decompress many" applications which the PDF spec is obviously one of them.
    jeffbee16 days ago
    Are you sure? Admittedly I only have 1 PDF in my homedir, but no combination of flags to zstd gets it to match the size of brotli's output on that particular file. Even zstd --long --ultra -22.
    xxs16 days ago
    on max compression (11 vs zstd's 22) of text brotli will be around 3-4% denser... and a lot slower. Decompression wise zstd is over 2x faster.
    The pdfs you have are already compressed with deflate (zip).
    DetroitThrow16 days ago
    I love zstd but this isn't necessarily true.
    dchest16 days ago
    Not with small files.
    Dylan1680716 days ago
    If that's about using predefined dictionaries, zstd can use them too.
    If brotli has a different advantage on small source files, you have my curiosity.
    If you're talking about max compression, zstd likely loses out there, the answer seems to vary based on the tests I look at, but it seems to be better across a very wide range.
    dchest14 days ago
    No, it's literally just compressing small files without training zstd dict or plugging external dictionaries (not counting the built-in one that brotli has). Especially for English text, brotli at the same speed as zstd gives better results for small data (in kilobyte to a few of megabyte range).
    itsdesmond16 days ago
    > Pareto
    I don’t think you’re using that correctly.
    wizzwizz416 days ago
    It's correct use of Pareto, short for Pareto frontier, if the claim being made is "for every needed compression ratio, zstd is faster; and for every needed time budget, zstd is faster". (Whether this claim is true is another matter.)
    stonogo16 days ago
    brotli is ubiquitous because Google recommends it. While Deflate definitely sucks and is old, Google ships brotli in Chrome, and since Chrome is the de facto default platform nowadays, I'd imagine it was chosen because it was the lowest-effort lift.
    Nevertheless, I expect this to be JBIG2 all over again: almost nobody will use this because we've got decades of devices and software in the wild that can't, and 20% filesize savings is pointless if your destination can't read the damn thing.
  - deepsun16 days ago
    Brotli compresses my files way better, but it's doing it way slower. Anyway, universal statement "zstd is better" is not valid.
    xxs15 days ago
    On max compression "--ultra -22", zstd is likely to be 2-4% less dense (larger) on text alike input. While taking over 2x times times to compress. Decompression is also much faster, usually over 2x.
    I have not tried using a dictionary for zstd.
  - greenavocado16 days ago
    This bizzare move has all the hallmarks of embrace-extend-extinguish rather than technical excellence
- mmooss16 days ago
  Note the language: "You're not creating broken files—you're creating files that are ahead of their time."
  Imagine a sales meeting where someone pitched that to you. They have to be joking, right?
  I have no objection to adding Brotli, but I hope they take the compatability more seriously. You may need readers to deploy it for a long time - ten years? - before you deploy it in PDF creation tools.
  - nxobject16 days ago
    (sarcasm warning...)
    You're absolutely right! It's not just an inaccurate slogan—it's a patronizing use of artificial intelligence. What you're describing is not just true, it's precise.
    mmooss15 days ago
    I don't understand your point ...
    eventualcomp15 days ago
    The commenter is making a joke about the style of delivery of the sentence you quoted, because the style is [1]characteristic of AI generated writing.
    [1]https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing
- spider-mario15 days ago
  > on a read-many format like pdf zstd’s decompression speed is a much better fit.
  brotli decompression is already plenty fast. For PDFs, zstd’s advantage in decompression speed is academic.
- deepsun16 days ago
  Well, except for speed, compression algorithms need to be compared in terms of compression, you know.
  Here's discussion by brotli's and zstd's staff:
  https://news.ycombinator.com/item?id=19678985
bhouston16 days ago
Are they using a custom dictionary with Brotli designed for PDFs? I am not sure if it would help or not, but it seems like one of those cases it may help?
Something like this:
https://developer.chrome.com/blog/shared-dictionary-compress...
In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.
- whizzx16 days ago
  The pdf association is still running experiments on whether or not to support custom dictionaries based on real life workloads gains.
  So it might land in the spec once it has proven if offers enough value
- Proclus16 days ago
  It seems they're using the standard dictionary, which is utterly bizzare.
  The standard Brotli dictionary bakes in a ton of assumptions about what the Web looked like in 2015, including not just which HTML tags were particularly common but also such things as which swear words were trendy.
  It doesn't seem reasonable to think that PDFs have symbol probabilities remotely similar to the web corpus Google used to come up with that dictionary.
  On top of that, it seems utterly daft to be baking that into a format which is expected to fit archival use cases and thus impose that 2015 dictionary on PDF readers for a century to come.
  I too would strongly prefer that they use zstd.
  - bhouston16 days ago
    BTW I've looked into custom dictionaries before for similar use cases and I suspect it would only offer like a 1% improvement or so for PDFs -- still good, but not a massive difference maker. The issue is that PDFs, like web pages, are incredibly repetitive in terms of their tags/structure. As such the custom dictionary only helps if the doc is really small, otherwise because of the repetitive nature, the self-inferred dictionary will resemble the custom dictionary after just a few blocks of PDF content.
    The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.
bobpaw16 days ago
How can iText claim that adding Brotli is not a backward incompatible change (in the "Why keep encoding separate" table)? In the first section the author states that any new feature must work seamlessly with existing readers. New documents created that include this compression would be unintelligible to any reader that only supports Deflate.
Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.
- whizzx16 days ago
  It's prototypish work to support it before it land's in the official specification. But it will indeed take some adoption time.
  Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.
  - croes16 days ago
    There are old devices where the viewer can’t be patched. That’s killing one of the main features of PDF
nialse16 days ago
Who is responsible for the terrible decision? In the pro vs con analysis, saving 20% size occasionally vs updating ALL pdf libraries/apps/viewers ever built SHOULD be a no-brainer.
superkuh16 days ago
This is nice, but PDF jumped the shark already. It's no longer a document format that always looks the same everywhere. The inclusion of "Dynamic XFA (XML Form Architecture) PDF" in the spec made it so PDF is an unreliable format. The aformentioned is a PDF without content that pulls down all it's content from the web. It even still, ostensibly, supports Flash (swf) animations. In practice these "PDF"s are just empty white pages with an error message like,
>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."
- kayodelycaon16 days ago
  Fortunately, XFA is deprecated. I haven’t seen one of those for a very long time.
  - superkuh16 days ago
    Maybe in spec, but the damage is done and persists.
    The (USA) Wisconsin Dept. of Natural Resources has nearly all their regulation PDFs as these XFA non-pdfs that I cannot read. So I cannot know the regulations. My emails about this topic (to multiple addresses over many years a dozen times) have gone unanswered.
    If Acrobat supports it it doesn't matter what the spec says. Until Adobe drops XFA from Acrobat and forces these extremely silly people to stop, PDF is no longer PDF.
ndriscoll16 days ago
What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).
- wongarsu16 days ago
  Few people enable file system compression, and even if they do it's usually with fast algorithms like lz4 or zstd -1. When authoring a document you have very different tradeoffs and can afford the cost of high compression levels of zstd or brotli.
- Someone16 days ago
  - inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text
  - when jumping from page to page, you won’t have to decompress the entire file
  - wizzwizz416 days ago
    > inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text
    Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.
    > when jumping from page to page, you won’t have to decompress the entire file
    This is already a thing with any compression format that supports quasi-random access, which is most of them. The answers to https://stackoverflow.com/q/429987/5223757 discuss a wide variety of tools for producing (and seeking into) such files, which can be read normally by tools not familiar with the conventions in use.
    Someone15 days ago
    > Okay, so we make a compressed container format that can perform such shenanigans, for the same amount of back-compat issues as extending PDF in this way.
    Far from the same amount:
    - existing tools that split PDFs into pages will remain working
    - if defensively programmed, existing PDF readers will be able to render PDFs containing JPEG XL images, except for the images themselves.
- eru16 days ago
  Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF.
  Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.
  - dunham16 days ago
    Don't you end up with PDF if you start with PS and restrict it to a subset? And maybe normalize the structure of the file a little. The structure is nice when you want to take the content and draw a bit more on the page. Or when subsetting/combining files.
    I suspect PDF was fairly sane in the initial incarnation, and it's the extra garbage that they've added since then that is a source of pain.
    I'm not a big fan of this additional change (nor any of the javascript/etc), but I would be fine with people leaving content streams uncompressed and running the whole file through brotli or something.
    eru16 days ago
    > Don't you end up with PDF if you start with PS and restrict it to a subset?
    PDF is also a binary format.
    mikkupikku16 days ago
    I thought PDFs can contain arbitrary PS.
  - lmz16 days ago
    Compression filters are in PostScript.
ksec16 days ago
Why not zstd?
- PunchyHamster16 days ago
  incompetence
  - whizzx16 days ago
    You can read about it here https://pdfa.org/brotli-compression-coming-to-pdf/
    jeffbee16 days ago
    That mentions zstd in a weird incomplete sentence, but never compares it.
    F3nd016 days ago
    They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?
    eviks16 days ago
    Hey, they did all the work and more, trust them!!!
    > Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.
    LoganDark16 days ago
    I love when I perform all the due diligence tasks. You just can't counter that. Yes but, they did all the due diligence tasks. They considered all the factors. Every one. Think you have one they didn't consider? Nope.
    jsnell16 days ago
    But they didn't write "all". They wrote "other", which absolutely does not imply full coverage.
    Maybe read things a bit more carefully before going all out on the snide comments?
    LoganDark15 days ago
    It implies potential coverage of anything one could bring up. It creates a similar impression in my mind, because it becomes easy to claim you already considered something.
    wizzwizz416 days ago
    In fact, they wrote "reviewing […] other due diligence tasks", which doesn't imply any coverage! This close, literal reading is an appropriate – nay, the only appropriate – way to draw conclusions about the degree of responsibility exhibited by the custodians of a living standard. By corollary, any criticism of this form could be rebuffed by appeal to a sufficiently-carefully-written press release.
  - 16 days ago
    undefined
- HackerThemAll16 days ago
  I think this was the main reason (from the linked article) LOL:
  "Brotli is a compression algorithm developed by Google."
  They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.
  Sheer incompetence.
  - cortesoft16 days ago
    I can’t imagine the people actually doing the technical work don’t know about Zstandard.
  - mort9616 days ago
    EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158
    I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.
    I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:
    +------+------+-----+------+--------+ | none | zstd | xz | gzip | brotli | +------|------|-----|------|--------| | 47M | 45M | 39M | 38M | 37M | +------+------+-----+------+--------+
    Here's a table with all the files:
    +------+------+------+------+--------+ | raw | zstd | xz | gzip | brotli | +------+------+------+------+--------+ | 12K | 12K | 12K | 12K | 12K | | 20K | 20K | 20K | 20K | 20K | x5 | 24K | 20K | 20K | 20K | 20K | x5 | 28K | 24K | 24K | 24K | 24K | | 28K | 24K | 24K | 24K | 24K | | 32K | 20K | 20K | 20K | 20K | x3 | 32K | 24K | 24K | 24K | 24K | | 40K | 32K | 32K | 32K | 32K | | 44K | 40K | 40K | 40K | 40K | | 44K | 40K | 40K | 40K | 40K | | 48K | 36K | 36K | 36K | 36K | | 48K | 48K | 48K | 48K | 48K | | 76K | 128K | 72K | 72K | 72K | | 84K | 140K | 84K | 80K | 80K | x7 | 88K | 136K | 76K | 76K | 76K | | 124K | 152K | 88K | 92K | 92K | | 124K | 152K | 92K | 96K | 92K | | 140K | 160K | 100K | 100K | 100K | | 152K | 188K | 128K | 128K | 132K | | 188K | 192K | 184K | 184K | 184K | | 264K | 256K | 240K | 244K | 240K | | 320K | 256K | 228K | 232K | 228K | | 440K | 448K | 408K | 408K | 408K | | 448K | 448K | 432K | 432K | 432K | | 516K | 384K | 376K | 384K | 376K | | 992K | 320K | 260K | 296K | 280K | | 1.0M | 2.0M | 1.0M | 1.0M | 1.0M | | 1.1M | 192K | 192K | 228K | 200K | | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M | | 1.2M | 1.1M | 1.0M | 1.0M | 1.0M | | 1.3M | 2.0M | 1.1M | 1.1M | 1.1M | | 1.7M | 2.0M | 1.7M | 1.7M | 1.7M | | 1.9M | 960K | 896K | 952K | 916K | | 2.9M | 2.0M | 1.3M | 1.4M | 1.4M | | 3.2M | 4.0M | 3.1M | 3.1M | 3.0M | | 3.7M | 4.0M | 3.5M | 3.5M | 3.5M | | 6.4M | 4.0M | 4.1M | 3.7M | 3.5M | | 6.4M | 6.0M | 6.1M | 5.8M | 5.7M | | 9.7M | 10M | 10M | 9.5M | 9.4M | +------+------+------+------+--------+
    Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.
    Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.
    I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p
    mort9616 days ago
    Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.
    Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):
    +---------+---------+--------+--------+--------+ | none | zstd | xz | gzip | brotli | +---------|---------|--------|--------|--------| | 47.81M | 37.92M | 37.96M | 38.80M | 37.06M | +---------+---------+--------+--------+--------+
    These numbers are much more impressive. Still, Brotli has a slight edge.
    terrelln16 days ago
    > | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |
    Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?
    If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.
    mort9616 days ago
    Okay now this is weird.
    I can reproduce it just fine ... but only when compressing all PDFs simultaneously.
    To utilize all cores, I ran:
    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait
    (and similar for the other formats).
    I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:
    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done
    That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).
    What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?
    terrelln16 days ago
    Yeah, `--adaptive` will enable adaptive compression, but it isn't enabled by default, so shouldn't apply here. But even with `--adaptive`, after compressing each block of 128KB of data, zstd checks that the output size is < 128KB. If it isn't, it emits an uncompressed block that is 128KB + 3B.
    So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.
    I will try to reproduce, but I suspect that there is something unrelated to zstd going on.
    What version of zstd are you using?
    mort9616 days ago
    'zstd --version' reports: "** Zstandard CLI (64-bit) v1.5.7, by Yann Collet **". This is zstd installed through Homebrew on macOS 26 on an M1 Pro laptop. Also of interest, I was able to reproduce this with a random binary I had in /bin: https://floss.social/@mort/115940378643840495
    I was completely unable to reproduce it on my Linux desktop though: https://floss.social/@mort/115940627269799738
    terrelln16 days ago
    I've figured out the issue. Use `wc -c` instead of `du`.
    I can repro on my Mac with these steps with either `zstd` or `gzip`:
    $ rm -f ksh.zst $ zstd < /bin/ksh > ksh.zst $ du -h ksh.zst 1.2M ksh.zst $ wc -c ksh.zst 1240701 ksh.zst $ zstd < /bin/ksh > ksh.zst $ du -h ksh.zst 2.0M ksh.zst $ wc -c ksh.zst 1240701 ksh.zst $ rm -f ksh.gz $ gzip < /bin/ksh > ksh.gz $ du -h ksh.gz 1.2M ksh.gz $ wc -c ksh.gz 1246815 ksh.gz $ gzip < /bin/ksh > ksh.gz $ du -h ksh.gz 2.1M ksh.gz $ wc -c ksh.gz 1246815 ksh.gz
    When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.
    I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)
    mort9616 days ago
    Interesting!
    It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.
    My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...
    terrelln16 days ago
    Yeah, it isn't quite that simple. E.g. `/bin/ksh` reports 1.4MB, but it is actually 2.4MB. Initially, I thought it was because the file was sparse, but there are only 493KB of zeros. So something else is going on. Perhaps some filesystem-level blocks are deduped from other files? Or APFS has transparent compression? I'm not sure.
    It does still seem odd that APFS is reporting a significantly larger disk-size for these files. I'm not sure why that would ever be the case, unless there is something like deferred cleanup work.
    mort9616 days ago
    Ross Burton on Mastodon suggests that it might be deduplication; when writing sequentially, later files can re-use blocks from earlier files, while that isn't the case as much when writing sequentially. That seems plausible enough to me.
    mort9616 days ago
    I've concluded that this can't be the reason. It'd only result in an error where the size reported by 'du' is smaller than the apparent size (aka number of bytes reported by 'wc -c') of the file. What we see here is that the size reported by 'du' is almost twice as large as the number of bytes. That can't be the result of dedpulication.
    I'll chalk it up to "some APFS weirdness".
    Zekio16 days ago
    doesn't zstd cap out at compression level 19?
    mort9616 days ago
    From the man page:
    --ultra: unlocks high compression levels 20+ (maximum 22), using a lot more memory.
    Regardless, this reproduces with random other files and with '-9' as the compression level. I made a mastodon post about it here: https://floss.social/@mort/115940378643840495
    gcr16 days ago
    If you're worried about double-compression of image data, you can uncompress all images by using qpdf:
    qpdf --stream-data=uncompress in.pdf out.pdf
    The resulting file should compress better with zstd.
    noname12016 days ago
    Why not use a more widespread compression algorithm (e.g. gzip) considering that Brotli barely performs better at all? Sounds like a pain for portability
    mort9616 days ago
    I'm not sold on the idea of adding compression to PDF at all, I'm not convinced that the space savings are worth breaking compatibility with older readers. Especially when you consider that you can just compress it in transit with e.g HTTP's 'Content-Encoding' without any special PDF reader support. (You can even use 'Content-Encoding: br' for brotli!)
    If you do wanna change PDF backwards-incompatibly, I don't think there's a significant advantage to choosing gzip to be honest, both brotli and zstd are pretty widely available these days and should be fairly easy to vendor. But yeah, it's a slight advantage I guess. Though I would expect that there are other PDF data sets where brotli has a larger advantage compared to gzip.
    But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)
    ksec16 days ago
    >But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)
    I may dislike Google. But my support of JPEG XL and Zstd has nothing to do with competition tech being Google at all. I simply think JPEG XL and Zstd are better technology.
    noname12016 days ago
    Could you add compression and decompression speeds to your table?
    mort9616 days ago
    I just did some interactive shell loops and globs to compress everything and output CSV which I processed into an ASCII table, so I don't exactly have a pipeline I can modify and re-run the tests with compression speeds added ... but I can run some more interactive shell-glob-and-loop-based analysis to give you decompression speeds:
    ~/tmp/pdfbench $ hyperfine --warmup 2 \ 'for x in zst/*; do zstd -d >/dev/null <"$x"; done' \ 'for x in gz/*; do gzip -d >/dev/null <"$x"; done' \ 'for x in xz/*; do xz -d >/dev/null <"$x"; done' \ 'for x in br/*; do brotli -d >/dev/null <"$x"; done' Benchmark 1: for x in zst/*; do zstd -d >/dev/null <"$x"; done Time (mean ± σ): 164.6 ms ± 1.3 ms [User: 83.6 ms, System: 72.4 ms] Range (min … max): 162.0 ms … 166.9 ms 17 runs Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done Time (mean ± σ): 143.0 ms ± 1.0 ms [User: 87.6 ms, System: 43.6 ms] Range (min … max): 141.4 ms … 145.6 ms 20 runs Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done Time (mean ± σ): 981.7 ms ± 1.6 ms [User: 891.5 ms, System: 93.0 ms] Range (min … max): 978.7 ms … 984.3 ms 10 runs Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done Time (mean ± σ): 254.5 ms ± 2.5 ms [User: 172.9 ms, System: 67.4 ms] Range (min … max): 252.3 ms … 260.5 ms 11 runs Summary for x in gz/*; do gzip -d >/dev/null <"$x"; done ran 1.15 ± 0.01 times faster than for x in zst/*; do zstd -d >/dev/null <"$x"; done 1.78 ± 0.02 times faster than for x in br/*; do brotli -d >/dev/null <"$x"; done 6.87 ± 0.05 times faster than for x in xz/*; do xz -d >/dev/null <"$x"; done
    As expected, xz is super slow. Gzip is fastest, zstd being somewhat slower, brotli slower again but still much faster than xz.
    +-------+-------+--------+-------+ | gzip | zstd | brotli | xz | +-------+-------+--------+-------+ | 143ms | 165ms | 255ms | 982ms | +-------+-------+--------+-------+
    I honestly expected zstd to win here.
    noname12016 days ago
    Thanks a lot. Interestingly Brotli’s author mentioned here that zstd is 2× faster at decompressing, which roughly matches your numbers:
    https://news.ycombinator.com/item?id=46035817
    I’m also really surprised that gzip performs better here. Is there some kind of hardware acceleration or the like?
    terrelln16 days ago
    Zstd should not be slower than gzip to decompress here. Given that it has inflated the files to be bigger than the uncompressed data, it has to do more work to decompress. This seems like a bug, or somehow measuring the wrong thing, and not the expected behavior.
    mort9616 days ago
    It seems like zstd is somehow compressing really badly when many zstd processes are run in parallel, but works as expected when run sequentially: https://news.ycombinator.com/item?id=46723158
    Regardless, this does not make a significant difference. I ran hyperfine again against a 37M folder of .pdf.zst files, and the results are virtually identical for zstd and gzip:
    +-------+-------+--------+-------+ | gzip | zstd | brotli | xz | +-------+-------+--------+-------+ | 142ms | 165ms | 269ms | 994ms | +-------+-------+--------+-------+
    Raw hyperfine output:
    ~/tmp/pdfbench $ du -h zst2 gz xz br 37M zst2 38M gz 38M xz 37M br ~/tmp/pdfbench $ hyperfine ... Benchmark 1: for x in zst2/*; do zstd -d >/dev/null <"$x"; done Time (mean ± σ): 164.5 ms ± 2.3 ms [User: 83.5 ms, System: 72.3 ms] Range (min … max): 162.3 ms … 172.3 ms 17 runs Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done Time (mean ± σ): 142.2 ms ± 0.9 ms [User: 87.4 ms, System: 43.1 ms] Range (min … max): 140.8 ms … 143.9 ms 20 runs Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done Time (mean ± σ): 993.9 ms ± 9.2 ms [User: 896.7 ms, System: 99.1 ms] Range (min … max): 981.4 ms … 1007.2 ms 10 runs Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done Time (mean ± σ): 269.1 ms ± 8.8 ms [User: 176.6 ms, System: 75.8 ms] Range (min … max): 261.8 ms … 287.6 ms 10 runs
    terrelln16 days ago
    Ah I understand. In this benchmark, Zstd's decompression time is 284 MB/s, and Gzip's is 330 MB/s. This benchmark is likely dominated by file IO for the faster decompressors.
    On the incompressible files, I'd expect decompression of any algorithm to approach the speed of `memcpy()`. And would generally expect zstd's decompression speed to be faster. For example, on a x86 core running at 2GHz, Zstd is decompressing a file at 660 MB/s, and on my M1 at 1276 MB/s.
    You could measure locally either using a specialized tool like lzbench [0], or for zstd by just running `zstd -b22 --ultra /path/to/file`, which will print the compression ratio, compression speed, and decompression speed.
    [0] https://github.com/inikep/lzbench
    16 days ago
    undefined
gcr16 days ago
If we're making breaking changes to PDFs, I'd love if the committee added a modern image format like JPEG-XL. In my experience, most disk usage of PDFs comes from images, not streams.
I keep a bunch of comics in PDF but JPEG-XL is by far the best way to enjoy them in terms of disk space.
- Bolwin16 days ago
  Odd you should say that, as that's exactly what they've been discussing
  - gcr16 days ago
    No it's not. This article is about proposing Brotli as another possible '/Filter' for stream objects, like content streams (page drawing commands). Images are streams too, but unless you mean compressing raw pixel bytes in Brotli, there's no mention of a JPEG-XL or WEBP filter.
    NoahZuniga16 days ago
    well, not mentioned in this specific article. But JPEG-XL support is something they're working on [1].
    [1]: https://pdfa.org/wp-content/uploads/2025/10/PDFDays2025-Brea...
    gcr15 days ago
    Oh cool!! TIL
    16 days ago
    undefined
whinvik16 days ago
I am often frustrated by PDF issues such as how complicated it is to create one.
But reading the article I realized PDFs have become ubiquitous because of its insistence on backwards compatibility. Maybe for some things it's good to move this slow.
- jhealy16 days ago
  The article is wrong, the PDF spec has introduced breaking changes plenty of times. It’s done slowly and conservatively though, particularly now that the format is an ISO spec.
  The PDF format is versioned, and in the past new versions have introduced things like new types of encryption. It’s quite probable that a v1.7 compliant PDF won’t open on a reader app written when v1.3 was the latest standard.
nbevans16 days ago
This is a really really bad idea. Don't break backwards compat. for 20% of gains. Internet connection speeds and storage capacities only go up. In a few years time, 20% of gains will seem crazy to have broken back-compat for.
cess1116 days ago
'Your PDF:s will open slower because we decided that the CDN providers are more important than you'.
If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.
The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.
- noname12016 days ago
  Ridiculous statement. CDN providers can already use filesystem compression and standard HTTP Accept-Encoding compression for transfers (which includes brotli by the way). This ISO provides virtually no benefit to them
  - cess1116 days ago
    This reasoning comes from TFA.
h4x0rr16 days ago
Wouldn't lzma2 be better here since a pdf is more read heavy?
- F3nd016 days ago
  Going by one of Brotli’s authors’ comment [1] on another post, it probably wouldn’t.
  [1] https://news.ycombinator.com/item?id=46035817
avalys16 days ago
This article is AI slop.
- jeffbee16 days ago
  Yep.
delfinom16 days ago
tl;dr Commerical entity is paying to have the ISO altered to "legalize" their SDK they are pushing which is incompatible with standard PDF readers.
ISO is pay to play so :shrug:
- whizzx16 days ago
  No this feature is coming straight from the PDF association itself and we just added experimental support before it's officially in the spec to help testing between different sdk processors.
  So your comment is a falsehood
- lmz16 days ago
  It's not even clear that they were the ones suggesting inclusion. They're just saying their library now supports the new thing.
  https://pdfa.org/brotli-compression-coming-to-pdf/
  > As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.
  > Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.
  - adrian_b16 days ago
    Yes, I do not see any source of financial gain that could motivate them for this, because both MuPDF and Ghostscript are free.
    MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow.
    It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).
- bhouston16 days ago
  I'm no fan of Adobe, but it is not that hard to add brotli support given that it is open. Probably can be added by AI without much difficulty - it is a simple feature. I think compared to the ton of other complex features PDF has, this is an easy one.
vgtftf16 days ago
[flagged]