Also, nncp-2024-06-05.tar.gz is just 1180969 bytes, unlike ts_zip-2024-03-02.tar.gz (159228453 bytes, which is bigger than uncompressed enwiki8).
The results at https://www.mattmahoney.net/dc/text.html explicitly add the size of the compressor itself to the result. Note the "enwik9+prog" column. That's what it's ranked on.
The reason to do this is that it's trivial to create a compressor that 'compresses' a file to 0 bytes. Just have an executable with a dictionary of enwik9 that writes that out given any input. So we always measure what is effectively the Kolmogorov complexity. The data+program as a whole that produces the result we want.
So those results add in the compressor size. The programs there generally have no dictionary built in or in the case of LLM based compressors, no pre-trained data. They effectively build the model as they process data. Not compressing much at all at the start and slowly compressing better and better as they go. This is why these programs do better and better with larger data sets. They start with 0 knowledge. After a GB or so they have very good knowledge of the corpus of human language.
This program here however is pre-trained and shipped with a model. It's 150MB in size! This means it has 150MB of extra starting knowledge over those models in that list. The top models in that list are the better compressors, they'll quickly out learn and overtake this compressor but they just don't have that headstart.
Of course measuring fairly this should be listed with that 150MB program size added to the results when doing a comparison.
A Turing Machine compressor program would likely have more bytes than the amd64 binary. So how to evaluate KolmogorovComplexity(amd64)?
The laws of physics somehow need to be accounted for too, probably.
Hopefully :-)
That is: no compression, but it won't make things worse either.
Unless the input data is the digits of pi, obviously, or the result of some computation involving pi.
You can use all the math stuff like scientific notation, tetration, etc... but it won't help you make things smaller.
Math notation is a form of compression. 10^9 is 1000000000, compressed. But the offset into pi is effectively a random number, and you can't compress random numbers no matter what technique you use, including math notation.
This can be formalized and mathematically proven. The only thing wrong here is that pi is not a random number, but unless you are dealing with circles, it looks a lot like it, so while unproven, I think it is a reasonable shortcut.
On a more serious note, as far as I understand these compression competitions require that static data is included in the size computation. So if you compress 1000 MB into 500 MB, but to decompress you need a 1 MB binary and a 100 MB initial dictionary, your score would be 500 + 100 + 1 = 601 MB, not 500 MB.
The relevance to this discussion is that the LLM weights would have to be included as static data, since the only way to regenerate them is from the initial training data, which is much larger than the resulting model. By comparison, pi based compression is the other way around: since pi is a natural constant, if your decompressor requires (say) a trillion digits of pi, you could write a relatively small program (a few kb) to generate them. It would be terribly slow, but it wouldn't affect your compression ratio much.
Ha, next: a compression algorithm that requires the user to first build an infinite library...
Well, either your program 'works', or you will have discovered a major new insight about Pi.
> On a more serious note, as far as I understand these compression competitions require that static data is included in the size computation. So if you compress 1000 MB into 500 MB, but to decompress you need a 1 MB binary and a 100 MB initial dictionary, your score would be 500 + 100 + 1 = 601 MB, not 500 MB.
And that's the only way to do this fairly, if you are running a competition where you only have a single static corpus to compress.
It would be more interesting and would make the results more useful, if the texts to be compressed would be drawn from a wide probability distribution, and then we scored people on eg the average length. Then you wouldn't necessarily need to include the size of the compressor and decompressor in the score.
Of course, it would be utterly impractical to sample Gigabytes of new text each time you need to run the benchmark: humans are expensive writers. The only way this could work would be either to sample via an LLM, but that's somewhat circular and wouldn't measure what you actually want to measure in the benchmark, or you could try to keep the benchmark text secret, but that has its own problems.
I've encountered it >10 years ago and it felt novel that compression is related to intelligence and even AGI.
When you train your neural network to minimise cross-entropy that's literally the same as making it better as a building block in an arithmetic coding data compressor. See https://en.wikipedia.org/wiki/Arithmetic_coding
See also https://learnandburn.ai/p/an-elegant-equivalence-between-llm...
I would like to know what deviations are in the output as this almost feels like a game of telephone where each re-compression results in a loss of data which is then incorrectly reconstructed. Sort of like misremembering a story and as you tell it over time the details change slightly.
If instead you run LLM prediction and then encode the probability of the next token of the input text you want to encode (from the cumulative distribution, a number in [0, 1]) using arithmetic coding, you can run the same operation in reverse to achieve lossless compression.
The tricky part is ensuring that your LLM executes absolutely deterministically, because you need to make sure that the encoder and decoder have the same probability distribution map at each step.
Arithmetic coding takes a prediction of the next bit and writes out exactly as many bits as needed to correct that prediction. The amazing part is that you can write out fractional bits. Eg. You predict the next bit is '1' with 75% probability? If it is 1 you only need to write out 1/2 of a bit (correcting that 25% portion). If it's 0 you need to write out 2.4bits. It may seem strange to work with 1/2 a bit but it works! (essentially the other half of the bit represents other future correction required). You might have heard of huffman coding which can't deal with fractional bits, arithmetic coding is a generalization of huffman coding that can deal with this.
Arithmetic coding is mathematically perfect at what it does. You will not waste a single bit using this algorithm to encode data given a prediction of that data.
So the entirety of modern compression techniques don't deal with the encoding/decoding side at all. What they deal with is modelling the data so far and making the most accurate prediction possible on the next bit of data (next byte also works, but working 1 bit at a time is easier to comprehend when learning arithmetic coding).
Incidentally the encoders and decoders essentially work exactly the same. Given the data read or data decoded so far predict the next bit. This part is exactly the same either way. The decoder would read the compressed file for the correction and the encoder would read the input file and write out the correction. The important part is "predict the next bit". This is what separates all the different compressors.
This is also where those of us experienced in this area try to correct people on the wrong track. A compression algorithm is never about the encoding side but instead 100% always about the prediction of the data. Can you build a model that can accurately predict what the next data to come is? That's what you need to do to make a better file compressor. The entropy encoding part is a completely solved problem already, don't bother re-solving that.
It's -log2(0.75) for getting a 75% chance right and -log2(0.25) for getting it wrong. I should have stated .4 bits and 2bits respectively not 0.5 and 2.4. Sorry! Good catch.
It's 3.2 vs 4bits. Now that may not seem huge but the probabilities tend to be at the more extreme ends if the predictor is any good. Once you start going towards the 99% range you get extreme efficiency.
Where the lossy part comes in is the point at which humans notice/don't notice data being thrown away. Got a bit that was waaay out of line in the prediction and going to cost you 10bits to correct? Perhaps to humans it doesn't matter? Can we throw it away? This throwing away of data is often done before the prediction+compression stage (eg. maybe quantizing the color space to 8bits from 32bits?) but from there on it's the same thing.
If you change the resolution or color space of the entire file you do that without consideration to where the extra details might have been needed.
So resolution should match typical output resolutions exactly and from there it's all on the codec.
> The Llama tokenizer used in this project sometimes permits multiple possible tokenizations for a given string.
Not having tokens be a prefix code is thoroughly unfortunate. Do the Llama team consider it a bug? I don't see how to rectify the situation without a full retrain, sadly.