Saying "Near lossless" to mean 90% accurate retrieval of saved vectors is simply a lie. Lossy-ness is binary, not something you can paper over with getting close enough. And 90% is not close. Sure, LLMs are all about gradient descent on noisy data sets so I guess this is acceptable in this field but that terminology usage still bothered me
90% depends entirely on what the measure means here, do you understand what "Normalized Discounted Cumulative Gain at rank 10" means to the set of data that we are comparing ?
Sometimes coming up with new codecs (compressors decompressors) means coming up with new ways to interpret artifacts of the real world. And this is exactly why LLM are so powerful and they are like a giant Lossy (but Near-Lossless for various use cases) ZIP file / Database of the whole knowledge of the training data.
Nobody is trying to manipulate you here, humanity just has to find new explanations for complex topics.
Lossy-ness is binary
Lossless is binary in pure information theory. to quote my other comment :Lossless is objective for information theory. To get from the real world to digital world you need an analog to digital converter, this process is by definition lossy. We are interested in the real world, and information is pure but never represents exactly reality. Lossyness is baked into our problem statement here.
Using terms like near lossless means we think we are very close to reality for what we’re trying to do
I could imagine a scenario where differences tend to be more substantive than you'd expect because of how less frequent words with fine distinctions in meaning - the very words that make the document special - may be embedded in the vector space.
I have also been working in compression and performance engineering, and managed to get a 99+% compression unlock versus conventional approaches (100+KB down to 1KB) in the scenario of 30 minute massive multiplayer game replays for a “game+engine” I’m developing
I think there’s a synergy between these 2 concepts I’d love to chat some more
In principle, binary x binary should be pretty fast since it just requires bitwise XNOR and popcount/reduction, but in practice it's slow unless you've really optimized it. And, as stated in the article, you'd still be losing a lot of accuracy that way.
You can't be a little bit on fire :)
sure
> if you treat it in a binary way where everything short of 100 falls into one "lossy" bucket you lose all the practical differences that make one encoding much better than another.
no; lossless is an inherently binary term. and I don't lose all the practical differences of better lossy encoders by understanding that; I'm not just going to start using mp3 96k because I have an understanding of lossless vs lossy encoders...
Lossless is an objectively binary term.
Lossyness is baked into our problem statement here.
Using terms like near lossless means we think we are very close to reality for what we’re trying to do
That typo up there is kind of endearing in the AI slop era.