But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.
The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.
It might be slightly more instructions than some other VL (variable-length) integer codec choices, but overall I don't think it's more difficult.
The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.
I spent WAYYYYYYYY too much time exploring this...
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
Does this encoding have a name?
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
They might have a different definition of adversarial than you.
> My tests for quite pedestrian APIs often contain adversarial input of obvious shapes.
This doesn't seem like what I would call adversarial.
This seems like standard negative testing or boundary value analysis - which I would be shocked if they didn't do.
The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.
The second is that they are implementing a spec that requires appending a varint length-prefixed field to a buffer but don't really care about the space optimization, don't know the field's length when they start appending it, and don't want to put the field into a second, temporary buffer or slide it down into place. https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08f47081... vs say my own code which does the slide: https://github.com/scottlamb/retina/blob/6972ac4261ce7bf5b58...
It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.
If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.
But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).
If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.
I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.
Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.
Edit: to clarify, I was talking about the author's argument being strange, not yours.
Edit: a properly written decoder is a lot more than 10 lines if you properly deal with integer overflow and both signed and unsigned ints.
And say you have it as part of some other data. If you want to be able to hash it by the raw memory bytes, many different ways to represent a number becomes a problem.
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.