Libbbf: Bound Book Format, A high-performance container for comics and manga(github.com)

106 pointsby zdw18 days ago14 comments

dfajgljsldkjag18 days ago
The feature matrix says cbz/zip doesn't have random page access, but it definitely does. Zip also supports appending more files without too much overhead.
Certainly there's a complexity argument to be made, because you don't actually need compression just to hold a bundle of files. But these days zip just works.
The perf measurement charts also make no sense. What exactly are they measuring?
Edit:
This reddit post seems to go into more depth on performance: old.reddit.com/r/selfhosted/comments/1qi64pr/comment/o0pqaeo/
- creata18 days ago
  Zip also has per-asset checksums, contrary to the comparison table.
  And what's the point of aligning the files to be "DirectStorage-ready" if they're going to be JPEGs, a format that, as far as I know, DirectStorage doesn't understand?
  And the author says it's a problem that "Metadata isn't native to CBZ, you have to use a ComicInfo.xml file.", but... that's not a problem at all?
  The whole thing makes no sense.
  - gwern18 days ago
    It makes no sense because it's some degree of AI slop: https://reddit.com/r/selfhosted/comments/1qi64pr/i_got_into_...
    Note that he doesn't quite say, when asked pointblank how much AI he used in his erroneous microbenchmarking, that he didn't use AI: https://reddit.com/r/selfhosted/comments/1qi64pr/i_got_into_...
    Which explains all of it.
    Kudos to /u/teraflop, for having infinitely more patience with this than I would.
    snailmailman18 days ago
    That whole subreddit has unfortunately become inundated with AI slop.
    It used to be a decent resource to learn about what services people were self hosting. But now, many posts are variations of, “I’ve made this huge complicated app in an afternoon please install it on your server”. I’ve even seen a vibe-coded password manager posted there.
    Reputable alternatives to the software posted there exist a a huge amount of the time. Not to mention audited alternatives in the case of password managers, or even just actively maintained alternatives.
    Semaphor18 days ago
    3 days ago the rules changed that vibe coded stuff is only allowed on Fridays.
    https://old.reddit.com/r/selfhosted/comments/1qfp2t0/mod_ann...
    Aransentin18 days ago
    I'm a moderator for a decently large programming subreddit, and I'd estimate about half the project submissions now being obvious slop. You get a very good nose for sniffing that stuff out after a while, though it can be frustrating when you can't really convince other people beyond going "trust me, it's slop".
- usefulposter18 days ago
  Bullshit asymmetry by way of impulsive LLM slop strikes again.
  Every new readme, announcement post, and codebase is tailored to achieve maximum bloviation.
  No substance, no credibility———just vibes.
  - panja18 days ago
    If you read the reddit thread, it was coded by hand then only bug checked with ai.
    toyg17 days ago
    It was benchmarked with AI. Benchmarks being the main reason for this thing existing...
    usefulcat17 days ago
    After reading the reddit comments, it looks like a primary problem is that the author doesn't (didn't?) understand how to benchmark it correctly. Like comparing the time to mmap() a file with the time to actually read the same file. Not at all the same thing.
    For example: https://old.reddit.com/r/selfhosted/comments/1qi64pr/i_got_i...
    Imustaskforhelp17 days ago
    I mean, its open source so people can create benchmark and independently verify if the AI was wrong and then have the claims be passed to the author.
    I haven't read the reddit thread or anything but If the author coded it by hand or is passionate about this project, he will probably understand what we are talking about.
    But I don't believe its such a big deal to have a benchmark be written by AI though? no?
    gwern17 days ago
    > I mean, its open source so people can create benchmark and independently verify if the AI was wrong and then have the claims be passed to the author.
    Thank you for volunteering. I look forward to your results.
    Imustaskforhelp17 days ago
    > Thank you for volunteering. I look forward to your results.
    Sure can you wait a few weeks tho? I know nothing about benchmarking so gonna learn it first and I have a few tests to prepare for irl.
    I do feel like someone else more passionate about the project should try to pick the benchmarking though.
    I don't mind benchmarking it but I only know tools like hyper for benchmarks & I have played with my fair share of zip archives and their random access retrieval but I feel like even that would depend from source to source.
    There are some experienced people in here who are really cool at what they do, I just wanted to say that if someone's interested and already has the Domain Specific knowledge to benchmark & they enjoy it in the first place, this having AI benchmark shouldn't be much of a problem in comparison.
    CyberDildonics17 days ago
    Why would someone spend their time checking someone else's AI slop when that person couldn't even be bothered to write the basic checks that prove their project was worthwhile?
its-summertime18 days ago
Thinking more about this: ZIP files can be set up to have the data on whatever alignment of one's choosing (as noted in the reddit thread). Integrity checks can be done in parallel by doing them in parallel. mmap is possible just by not using zip compression.
The aspect of integrity checking speed in a saturated context (N workers, regardless if its multiple workers per file, or a worker per file), CRC32(C) seems to be nearly twice as fast https://btrfs.readthedocs.io/en/latest/Checksumming.html
ZIP can also support arbitrary metadata.
I think this could have all been backported to ZIP files themselves
grumbel18 days ago
This feels like the wrong end to optimize. Zip is plenty of fast, especially when it comes to a few hundred pages of a comic. Meanwhile the image decoding can take a while when you want to have a quick thumbnail overview showing all those hundred pages at once. No comic/ebook software I have ever touched as managed to match the responsiveness of an actual book where you can flip through those hundreds of pages in a second with zero loading time, despite it being somewhat trivial to implement when you generate the necessary thumbnail/image-pyramid data first.
A multi-resolution image format would make more sense than optimizing the archive format. There would also be room for additional features like multi-language support, searchable text, … that the current "jpg in a zip" doesn't handle (though one might end up reinventing DJVU here).
- fc417fc80217 days ago
  > A multi-resolution image format
  There are already quite a few cbz archives in the wild that contain jxl encoded images. That's a multi-resolution format at least to the extent that it supports progressive decoding at fixed levels that range from 1:8 to as high as 1:4096. I think it might also support other arbitrary ratios subject to certain encoding constraints but I'm less clear on that.
  Readers might need to be updated to make use of the feature in an intelligent manner though. The jxl cbzs I've encountered either didn't make use of progressive encoding or else the software I used failed to take advantage of it - I'm not sure which.
its-summertime18 days ago
https://www.reddit.com/r/selfhosted/comments/1qi64pr/i_got_i...
- wernsey18 days ago
  Maybe you should quote the full title of that post:
  "I got into an argument on Discord about how inefficient CBR/CBZ is, so I wrote a new file format. It's 100x faster than CBZ."
  It has some charts, notes and comments
  Here's the old.reddit link: https://old.reddit.com/r/selfhosted/comments/1qi64pr/i_got_i...
riffraff18 days ago
At a glance this looks like an obviously nicer format that a zip of jpegs, but I struggle to think of a time I thought "wow CBZ is a problem here".
I didn't even realize random access is not possible, presumably because readers just support it by linear scanning or putting everything in memory at once, and comic size is peanuts compared to modern memory size.
I suppose this becomes more useful if you have multiple issues/volumes in a single archive.
- aidenn018 days ago
  Random access is completely possible within a zip, to the degree that it's needed for cbz; you might not be able to randomly access within a file, if for some reason the cbz was stored with deflate on a jpeg, but you can always access individual files independently of each other, so seeking to a random page is O(1).
  - formerly_proven18 days ago
    ZIP literally has a central directory.
    I don’t understand what’s the point of any of this over a minimal subset of PDF (one image per page).
PufPufPuf18 days ago
"Native Data Deduplication" not supported in CBZ/CBR? But those are just ZIP/RAR, which are compression formats, deduplication is their whole deal...?
- greysonp17 days ago
  They may be referring to the fact that ZIP compresses each file individually. It can't compress across files. I think RAR does compress across files though.
remix200018 days ago
I thought zips already support random access?
Am4TIfIsER0ppos17 days ago
What's wrong with every page being a separate image file on your disk?
lsbehe18 days ago
Why are the metadata blocks the way they are? I see you used pack directives but there already are plenty of padding and reserved bits. A 19 byte header just seems wrong. https://github.com/ef1500/libbbf/blob/b3ff5cb83d5ef1d841eca1...
yonisto18 days ago
Honest question, something I don't understand, if you use DirectStorage to move images directly to the GPU (I assume into the VRAM) where the decoding take place? directly on the GPU? Can GPU decode PNG? it is very unfriendly format for GPU as far as I know
- PufPufPuf18 days ago
  From the readme: > Note: DirectStorage isn't avaliable for images yet (as far as I know), but I've made sure to accomodate such a thing in the future with this format.
  So the whole DirectStorage thing is just a nothingburger. The author glosses over the fact that decoding images on GPU is not possible (or at least very impractical).
  - zigzag31217 days ago
    It seems that at least JPEG can be decoded on the GPU [1] [2]
    [1] https://docs.nvidia.com/cuda/nvjpeg/index.html
    [2] https://github.com/CESNET/GPUJPEG
  - fc417fc80217 days ago
    The entire thing makes no sense. How many images per second do you need to decode here? How big is the archive even?
    It would be one thing if you were designing a format to optimize feeding data to an ML model during training but that's not even remotely what this is supposed to be.
  - yonisto17 days ago
    The note was added after I posted the question. It really didn't make any sense to me
chromehearts18 days ago
But with which library are you able to host these? And which scraper currently finds manga with chapters in that file format? does anybody have experience hosting their own manga server & downloading them?
jmillikin18 days ago
I use CBZ to archive both physical and digital comic books so I was interested in the idea of an improved container format, but the claimed improvements here don't make sense.
---
For example they make a big deal about each archive entry being aligned to a 4 KiB boundary "allowing for DirectStorage transfers directly from disk to GPU memory", but the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.
Furthermore the README says "While folders allow memory mapping, individual images within them are rarely sector-aligned for optimized DirectStorage throughput" which ... what? If an image file needs to be sector-aligned (!?) then a BBF file would also need to be, else the 4 KiB alignment within the file doesn't work, so what is special about the format that causes the OS to place its files differently on disk?
Also in the official DirectStorage docs (https://github.com/microsoft/DirectStorage/blob/main/Docs/De...) it says this:
```
  > Don't worry about 4-KiB alignment restrictions
  > * Win32 has a restriction that asynchronous requests be aligned on a
  >   4-KiB boundary and be a multiple of 4-KiB in size.
  > * DirectStorage does not have a 4-KiB alignment or size restriction. This
  >   means you don't need to pad your data which just adds extra size to your
  >   package and internal buffers.
```
Where is the supposed 4 KiB alignment restriction even coming from?
There are zip-based formats that align files so they can be mmap'd as executable pages, but that's not what's happening here, and I've never heard of a JPEG/PNG/etc image decoder that requires aligned buffers for the input data.
Is the entire 4 KiB alignment requirement fictitious?
---
The README also talks about using xxhash instead of CRC32 for integrity checking (the OP calls it "verification"), claiming this is more performant for large collections, but this is insane:
```
  > ZIP/RAR use CRC32, which is aging, collision-prone, and significantly slower
  > to verify than XXH3 for large archival collections.  
  > [...]  
  > On multi-core systems, the verifier splits the asset table into chunks and
  > validates multiple pages simultaneously. This makes BBF verification up to
  > 10x faster than ZIP/RAR CRC checks.
```
CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation. Assuming 100 GiB/s throughput, a typical comic book page (a few megabytes) will take like ... a millisecond? And there's no data dependency between file content checksums in the zip format, so for a CBZ you can run the CRC32 calculations in parallel for each page just like BBF says it does.
But that doesn't matter because to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash. Checksum each archive (not each page), store that checksum as a `.sha256` file (or whatever), and now you can (1) use normal tools to check that your archives are intact, and (2) record those checksums as metadata in the blob storage service you're using.
---
The Reddit thread has more comments from people who have noticed other sorts of discrepancies, and the author is having a really difficult time responding to them in a coherent way. The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.
- zigzag31217 days ago
  > the pages within a CBZ are going to be encoded (JPEG/PNG/etc) rather than just being bitmaps. They need to be decoded first, the GPU isn't going to let you create a texture directly from JPEG data.
  It seems that JPEG can be decoded on the GPU [1] [2]
  > CRC32 is limited by memory bandwidth if you're using a normal (i.e. SIMD) implementation.
  According to smhasher tests [3] CRC32 is not limited by memory bandwidth. Even if we multiply CRC32 scores x4 (to estimate 512 bit wide SIMD from 128 bit wide results), we still don't get close to memory bandwidth.
  The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely an improvement over CRC32.
  > to actually check the integrity of archived files you want to use something like sha256, not CRC32 or xxhash
  Why would you need to use a cryptographic hash function to check integrity of archived files? Quality a non-cryptographic hash function will detect corruptions due to things like bit-rot, bad RAM, etc. just the same.
  And why is 256 bits needed here? Kopia developers, for example, think 128 bit hashes are big enough for backup archives [4].
  [1] https://docs.nvidia.com/cuda/nvjpeg/index.html
  [2] https://github.com/CESNET/GPUJPEG
  [3] https://github.com/rurban/smhasher
  [4] https://github.com/kopia/kopia/issues/692
  - myrmidon17 days ago
    Maybe the CRC32 implementations in the smasher suite just aren't that fast?
    [1] claims 15 GB/s for the slowest implementation (Chromium) they compared (all vectorized).
    > The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely an improvement over CRC32.
    Why? What kind of error rate do you expect, and what kind of reliability do you want to achieve? Assumptions that would lead to a >32bit checksum requirement seem outlandish to me.
    [1] https://github.com/corsix/fast-crc32?tab=readme-ov-file#x86_...
    zigzag31217 days ago
    From SMHasher test results quality of xxhash seems higher. It has less bias / higher uniformity that CRC.
    What bothers me with probability calculations, is that they always assume perfect uniformity. I've never seen any estimates how bias affects collision probability and how to modify the probability formula to account for non-perfect uniformity of a hash function.
    jmillikin17 days ago
    It doesn't matter, though. xxhash is better than crc32 for hashing keys in a hash table, but both of them are inappropriate for file checksums -- especially as part of a data archival/durability strategy.
    It's not obvious to me that per-page checksums in an archive format for comic books are useful at all, but if you really wanted them for some reason then crc32 (fast, common, should detect bad RAM or a decoder bug) or sha256 (slower, common, should detect any change to the bitstream) seem like reasonable choices and xxhash/xxh3 seems like LARPing.
    wyldfire17 days ago
    > both of them are inappropriate for file checksums
    CRCs like CRC32 were born for this kind of work. CRCs detect corruption when transmitting/storing data. What do you mean when you say that it's inappropriate for file checksums? It's ideal for file checksums.
    minitech17 days ago
    Uniformity isn’t directly important for error detection. CRC-32 has the nice property that it’s guaranteed to detect all burst errors up to 32 bits in size, while hashes do that with probability at best 2^−b of course. (But it’s valid to care about detecting larger errors with higher probability, yes.)
    zigzag31217 days ago
    > Uniformity isn’t directly important for error detection.
    Is there any proof of this? I'm interested in reading more about it.
    > detect all burst errors up to 32 bits in size
    What if errors are not consecutive bits?
    minitech17 days ago
    There’s a whole field’s worth of really cool stuff about error correction that I wish I knew a fraction of enough to give reading recommendations about, but my comment wasn’t that deep – it’s just that in hashes, you obviously care about distribution because that’s almost the entire point of non-cryptographic hashes, and in error correction you only care that x ≠ y implies f(x) ≠ f(y) with high probability, which is only directly related in the obvious way of making use of the output space (even though it’s probably indirectly related in some interesting subtler ways).
    E.g. f(x) = concat(xxhash32(x), 0xf00) is just as good at error detection as xxhash32 but is a terrible hash, and, as mentioned, CRC-32 is infinitely better at detecting certain types of errors than any universal hash family.
    zigzag31217 days ago
    This seems to make sense, but I need to read more about error correction to fully understand it. I was considering possibility that data could also contain patterns where error detection performs poorly due to bias, and I haven't seen how to include these estimates in probability calculations.
  - fc417fc80217 days ago
    > The 32 bit hash of CRC32 is too low for file checksums.
    What makes you say this? I agree that there are better algorithms than CRC32 for this usecase, but if I was implementing something I'd most likely still truncate the hash to somewhere in the same ballpark (likely either 32, 48, or 64 bits).
    Note that the purpose of the hash is important. These aren't being used for deduplication where you need a guaranteed unique value between all independently queried pieces of data globally but rather just to detect file corruption. At 32 bits you have only a 1 out of 2^(32-1) chance of a false negative. That should be more than enough. By the time you make it to 64 bits, if you encounter a corrupted file once _every nanosecond_ for the next 500 years or so you would expect to miss only a single event. That is a rather absurd level of reliability in my view.
    zigzag31217 days ago
    I've seen few arguments that with the amount of data we have today the 2^(32-1) chance can happen, but I can't vouch their calculations were done correctly.
    Readme in SMHasher test suite also seems to indicate that 32 bits might be too few for file checksums:
    "Hash functions for symbol tables or hash tables typically use 32 bit hashes, for databases, file systems and file checksums typically 64 or 128bit, for crypto now starting with 256 bit."
    fc417fc80217 days ago
    That's vaguely describing common practices, not what's actually necessary or why. It also doesn't address my note that the purpose of the hash is important. Are "file systems" and "file checksums" referring to globally unique handles, content addressed tables, detection of bitrot, or something else?
    For detecting file corruption the amount of data alone isn't the issue. Rather what matters is the rate at which corruption events occur. If I have 20 TiB of data and experience corruption at a rate of only 1 event per TiB per year (for simplicity assume each event occurs in a separate file) that's only 20 events per year. I don't know about you but I'm not worried about the false negative rate on that at 32 bits. And from personal experience that hypothetical is a gross overestimation of real world corruption rates.
    zigzag31217 days ago
    It depends on how you calculate statistics. If you are designing a file format that over the lifetime of the format hundreds of millions of user will use (storing billions of files), what are the chances that 32 bits checksum won't be able to catch at least one corruption? During transfer over unstable wireless internet connection, storage on cheap flash drive, poor HDD with a higher error rate, unstable RAM etc. We want to avoid data corruption if we can even in less then ideal conditions. Cost of going from 32 bit to 64 bit hashes is very small.
    fc417fc80217 days ago
    No, it doesn't "depend on how you calculate statistics". Or rather you are not asking the right question. We do not care if a different person suffers a false negative. The question is if you, personally, are likely to suffer a false negative. In other words, will any given real world deployment of the solution be expected to suffer from an unacceptably high rate of false negatives?
    Answering that requires figuring out two things. The sort of real world deployment you're designing for and what the acceptable false negative rate is. For an extremely conservative lower bound suppose 1 error per TiB per year and suppose 1000 TiB of storage. That gives a 99.99998% success rate for any given year. That translates to expecting 1 false negative every 4 million years.
    I don't know about you but I certainly don't have anywhere near a petabyte of data, I don't suffer corruption at anywhere near a rate of 1 event per TiB per year, and I'm not in the business of archiving digital data on a geological timeframe.
    32 bits is more than fit for purpose.
    zigzag31217 days ago
    I can't say I agree with your logic here. We are not talking about any specific backup or anything like that. We are talking about the design of a file format that is going to be used globally.
    Business running a lottery has to calculate the odds of anyone winning, not just the odds of a single person winning. Same, a designer of a file format has to consider chances for all users. What percent of users will be affected by any design decision.
    For example, what if you would offer a guarantee that 32 bit hash will protect you from corruption, and compensate generously anyone who would get this type of corruption; how would you calculate probability then?
    fc417fc80217 days ago
    If you offer compensation then of course you need to consider your risk exposure, ie total users. That's similar to a lottery where the central authority is concerned with all payouts while an individual is only concerned with their own payout.
    Outside of brand reputation issues that is not how real world products are designed. You design a tool for the specific task it will be used for. You don't run your statistics in aggregate based on the expected number of customers.
    Users are independent from one another. If the population doubles my filesystem doesn't suddenly become less reliable. If more people purchase the same laptop that I have the chance of mine failing doesn't suddenly go up. If more people deep fry things in their kitchen my own personal risk of a kitchen fire isn't increased regardless of how busy the fire department might become.
  - jmillikin17 days ago
    > It seems that JPEG can be decoded on the GPU [1] [2]
    Sure, but you wouldn't want to. Many algorithms can be executed on a GPU via CUDA/ROCm, but the use cases for on-GPU JPEG/PNG decoding (mostly AI model training? maybe some sort of giant megapixel texture?) are unrelated to anything you'd use CBZ for.
    For a comic book the performance-sensitive part is loading the current and adjoining pages, which can be done fast enough to appear instant on the CPU. If the program does bulk loading then it's for thumbnail generation which would also be on the CPU.
    Loading compressed comic pages directly to the GPU would be if you needed to ... I dunno, have some sort of VR library browser? It's difficult to think of a use case.
    > According to smhasher tests [3] CRC32 is not limited by memory bandwidth. > Even if we multiply CRC32 scores x4 (to estimate 512 bit wide SIMD from 128 > bit wide results), we still don't get close to memory bandwidth.
    Your link shows CRC32 at 7963.20 MiB/s (~7.77 GiB/s) which indicates it's either very old or isn't measuring pure CRC32 throughput (I see stuff about the C++ STL in the logs).
    Look at https://github.com/corsix/fast-crc32 for example, which measures 85 GB/s (GB, GiB, eh close enough) on the Apple M1. That's fast enough that I'm comfortable calling it limited by memory bandwidth on real-world systems. Obviously if you solder a Raspberry Pi to some GDDR then the ratio differs.
    > The 32 bit hash of CRC32 is too low for file checksums. xxhash is definitely > an improvement over CRC32.
    You don't want to use xxhash (or crc32, or cityhash, ...) for checksums of archived files, that's not what they're designed for. Use them as the key function for hash tables. That's why their output is 32- or 64-bits, they're designed to fit into a machine integer.
    File checksums don't have the same size limit so it's fine to use 256- or 512-bit checksum algorithms, which means you're not limited to xxhash.
    > Why would you need to use a cryptographic hash function to check integrity > of archived files? Quality a non-cryptographic hash function will detect > corruptions due to things like bit-rot, bad RAM, etc. just the same.
    I have personally seen bitrot and network transmission errors that were not caught by xxhash-type hash functions, but were caught by higher-level checksums. The performance properties of hash functions used for hash table keys make those same functions less appropriate for archival.
    > And why is 256 bits needed here? Kopia developers, for example, think 128 > bit hashes are big enough for backup archives [4].
    The checksum algorithm doesn't need to be cryptographically strong, but if you're using software written in the past decade then SHA256 is supported everywhere by everything so might as well use it by default unless there's a compelling reason not to.
    For archival you only need to compute the checksums on file transfer and/or periodic archive scrubbing, so the overhead of SHA256 vs SHA1/MD5 doesn't really matter.
    I don't know what kopia is, but according to your link it looks like their wire protocol involves each client downloading a complete index of the repository content, including a CAS identifier for every file. The semantics would be something like Git? Their list of supported algorithms looks reasonable (blake, sha2, sha3) so I wouldn't have the same concerns as I would if they were using xxhash or cityhash.
    zigzag31217 days ago
    > which can be done fast enough to appear instant on the CPU
    Big scanned PDFs can be problfrom more efficient processing (if it had HW support for such technique)
    > Your link shows CRC32 at 7963.20 MiB/s (~7.77 GiB/s) which indicates it's either very old or isn't measuring pure CRC32 throughput
    It may not be fastest implementation of CRC32, but it's also done on old Ryzen 5 3350G 3.6GHz. Below the table are results done on different HW. On Intel i7-6820HQ CRC32 achieves 27.6 GB/s.
    > measures 85 GB/s (GB, GiB, eh close enough) on the Apple M1. That's fast enough that I'm comfortable calling it limited by memory bandwidth on real-world systems.
    That looks incredibly suspicious since Apple M1 has maximum memory bandwidth of 68.25 GB/s [1].
    > I have personally seen bitrot and network transmission errors that were not caught by xxhash-type hash functions, but were caught by higher-level checksums. The performance properties of hash functions used for hash table keys make those same functions less appropriate for archival.
    Your argument is meaningless without more details. xxhash supports 128 bits, which I doubt wouldn't be able to catch an error in you case.
    SHA256 is an order of magnitude or more slower than non-cryptographic hashes. In my experience archival process usually has big enough effect on performance to care about it.
    I'm beginning to suspect your primary reason for disliking xxhash is because it's not de facto standard like CRC or SHA. I agree that this is a big one, but you constantly imply like there's more to why xxhash is bad. Maybe my knowledge is lacking, care to explain? Why wouldn't 128 bit xxhash be more than enough for checksums of files. AFAIK the only thing it doesn't do is protect you against tampering.
    > I don't know what kopia is, but according to your link it looks like their wire protocol involves each client downloading a complete index of the repository content, including a CAS identifier for every file. The semantics would be something like Git? Their list of supported algorithms looks reasonable (blake, sha2, sha3) so I wouldn't have the same concerns as I would if they were using xxhash or cityhash.
    Kopia uses hashes for block level deduplication. What would be an issue, if they used 128 bit xxhash instead of 128 bit cryptographic hash like they do now (if we assume we don't need to protection from tampering)?
    [1] https://en.wikipedia.org/wiki/Apple_M1
    minitech17 days ago
    > What would be an issue, if they used 128 bit xxhash instead of 128 bit cryptographic hash like they do now (if we assume we don't need to protection from tampering)?
    malicious block hash collisions where the colliding block was introduced by some way other than tampering (e.g. storing a file created by someone else)
    zigzag31217 days ago
    That's a good example. Thanks! It would be kind of an indirect tampering method.
- creata18 days ago
  > The most charitable interpretation is that this whole project (supposed problems with CBZ, the readme, the code) is the output of an LLM.
  Do LLMs perform de/serialization by casting C structs to char-pointers? I would've expected that to have been trained out of them. (Which is to say: lots of it is clearly LLM-generated, but at least some of the code might be human.)
  Anyway, I hope that the person who published this can take all the responses constructively. I know I'd feel awful if I was getting so much negative feedback.
sedatk18 days ago
> Footer indexed
So, like ZIP?
> Uses XXH3 for integrity checks
I don’t think XXH3 is suitable for that purpose. It’s not cryptographically secure and designed mostly for stuff like hash tables (e.g. relatively small data).
- zigzag31218 days ago
  Why would it need to be cryptographically secure for this use case?
  - rovr13817 days ago
    If the data is big enough, collisions. Right?
    zigzag31217 days ago
    Then you just need a bigger hash.
- MallocVoidstar18 days ago
  > It’s not cryptographically secure
  Neither is CRC32. I'm pretty sure xxhash is a straight upgrade compared to CRC32.
  - myrmidon17 days ago
    > I'm pretty sure xxhash is a straight upgrade compared to CRC32.
    Unclear; performance should be pretty similar to CRC32 (depending on implementation), and since integrity checking can basically be done at RAM read speeds this should not matter either way.
aidenn018 days ago
I assume the comparison table is supposed to have something other than footnotes (e.g. check-marks or X's)? That's not showing for me on Firefox
- QuantumNomad_18 days ago
  There are emojis in the table for green check marks, red crosses, and yellow warning signs.
  Do the emojis not show for you?
  - aidenn018 days ago
    They do not.
    [edit]
    If I download the README I can see them in every program on my system except Firefox. I previously had issues with CJK only not displaying in Firefox, so there's probably some workaround specific to it...
    [Edit 2] If Firefox uses "Noto Color Emoji" (which Firefox seems to use as fallback for any font that doesn't have Emoji characters; fc-match shows a different result for e.g. :charset=2705) then I get nothing, but if I force a font that has the emoji in it (e.g. FreeSerif) then it renders. Weird.
- leosanchez18 days ago
  They are just below the table.