There's a massive push to add unnecessary complexity to everything out there, because complexity pays all our bills.
Like, you are letting them data mine your business. Why are corporations not panicing over this?
(Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)
Is actual ZDR verbiage in contracts more specific and limited in scope than what we see advertised publicly ("...except where needed to comply with law or combat misuse" in Anthropic's case)? Because those seem pretty damn vague and large enough holes to drive trucks through.
to comply with the law, we must send to the police our detections of illegal activity >:|
a guy subpeonaed your chats, i guess we stored them (oops) so now it's illegal to destroy it...
As the customer base becomes more and more corporate (which it will), they end up with disproportionately more customers whose experiences cannot be used to train the model to make it better for those customers.
Either way, corporate customers cannot leach off the training from consumers handing over their personal data forever; there aren't enough specialists in that training set to improve the models with no loss of corporate trust.
Betrayal of their trust is inevitable.
At some point, where does the training advantage for specialist LLMs come from, if not progressively encroaching on customer data for the benefit of equivalent customers?
I’m not making any accusations, but we should not underestimate their tolerance for legal and financial risk.
It may be a little paranoid to insist on self hosting based on that, but I’m not so sure that it’s crazy.
Which they did do, but scale is relatively miniscule to the full dataset.
How many people would take it?
I know I'd actually be tempted. Con: total loss of privacy. Pro: it folds laundry, and I f'ing loathe laundry with the intensity of a billion suns.
Every business has similar trade-offs they'd be tempted to take.
The implied part the children already know from other stories is:
The magic elves have a recorded history of laughing at their customers when they are on the toilette, hitting on their husbands/wifes and misleading their children into worshipping the elvendom.
The story ends in some sort of catharsis for the protagonist when the elves go one step too far. In the happy ending variant Disney makes a version off it is not too late.
i also believe that we will live in a post scarcity world, which means profit is no longer interesting, so any business case for invading your privacy will go away and therefore it will only happen for personal interest.
the key in any case will be education, because without it abuse will be rampant and progress will halt because everyone is going to be suspicious of everyone else.
i’m not sure why so many of us have fallen into this… “there is no other future” thing…
there are other options. plenty of them. there is no singular solution. we could always just say “no”. and that’s that. that would be one option.
why do we feel like there is no other way? why are we afraid to say “nah”?
Imagine Google search without any links or sources named
This is the “modern” AI chatbot:
It never mentions the training data it used, in fact has no idea what it used (often FB, Reddit and partisan websites)
Update: I added the reply about after the fact Googling chatbots do - it’s different
Or at least some of the sites, if the same info is sourced from 100 pages then it only shows 2 or 3, maybe the ones with the biggest PageRanks.
But those links are Googled after the model started to answer, they are not the links to the training data
Imagine an artificial “librarian” that read all the books and spits hallucinated quotes for you
But doesn’t let you enter the library, open a single book or even see the sources for those hallucinated quotes
But instead Googles some sources based on hallucinations after generating them ;-)
It’s better than nothing but you can Google them, too, while training data (the library) is completely hidden from you, even the public domain parts of it - zero attribution
So if it sources something in Wikipedia, it is more likely to provide Wikipedia as a trusted source for it.
The problem is when an answer is hallucinated, false, it may provide a source for it which contains the invalid info.
OlmoTrace, Guide Labs with Clarity and a few more
Labs train the model with attribution baked-in and they say the bigger the model - the more interpretable it becomes
Pretty sure it’s the future
Most of it is just misinformation, after all. People say knowledge shouldn't be restricted, but now we have the opposite problem. There's so much information that just skimming through it takes too much time. On top of that, as we shift from text to video, getting information has become even harder. Compared to text, YouTube videos feel like they have much lower information density. I've heard that the TikTok generation's text literacy is declining, but maybe that's actually a social adaptation to process as much data as possible from low-density sources
In that sense, the efficiency of RAG ultimately comes down to what kind of good knowledge you're feeding into the AI.