“Make data get smoothed out” is a very strange way of saying “smooths out data”
> The weird, rare, surprising patterns [that make data rich] slowly get smoothed out when an AI model trains on outputs from a previous model.
i.e., the patterns are responsible for making data rich, and they are slowly lost as each new generation model trains on the prior generation's output.
Or, if you'd prefer an analogy, we're using a copy machine to output new documents by taking the last copy spit out by the machine, adding some marks to it, and running it through the copier again. Over time, details present in much older copies blur and fade away in Nth generation copies.
Actually, what you are describing is what happens when LLM-generated prose cycles and then trains humans to use equally dull thinking.
Type request, get info.
But that's such a narrow/one dimensional view of how LLMs are used. They can gather data or write an article, but that's probably a minority of use cases.
People have casual conversations with them, code written, brainstorming sessions, dictating a voice-recorded note, and the list goes on.
While data its getting trained on is important, the supposition is that this data consists only of what sits out there on the interwebs.
That as oppose to user input/interaction which, I'm guessing, has a pretty large role in training models. Maybe even more so in some cases than AI-written blog spam.
It's like dictating to a typist like they did in the 60's - he will make sure that your letter looks professional and will fix your grammar, but you will sign the letter. This is totally different from LLM spam, the kind that inflates a sentence into a three-page article full of nothing.
So - is it a problem if the language reverts to a mean? that is the point of a shared language, right?
Which, I mean, fair enough within these constraints, but it's cited like it's a universal law.
Really all that can be taken away from the study is "we trained a very small model on data generated from it in a particular way, and this was eventually harmful for the model."
Also note that models are nowadays trained on massively self-generated data (task RL post-training) and it seems to significantly improve their performance.
I agree - but as the Internet descends into all-slop-all-the-time (seriously, just do a search for reviews or travel advice or technical questions -or most anything - to see it), where do you expect the high quality training material on future things to come from? I have a hard time imagining it.
Textbooks, company wikis, news corpora, structured reports of all kinds from far more sources than what is available on the web.
Sadly, enterprise fizzbuzz style is wildly successful compared to ghostty style.
Put another way, a gem of code versus the masses of mess. It's amazing new models aren't worse. And now most of this human interaction is with vibers.
LLMs trained by the crowd risk being medianizers, or rather, mediocritizers.
One need not look further than "Absolutely!" to see this in play -- user selection matters for corpus matters for model. Suddenly content everywhere is “Little houses, all alike.”
On your second line -- I couldn't agree more strongly.
ANTHROP\C has been sitting inside high performance white collar industries with top builders, that signal is priceless compared to feedback farms in Kenya.
Bet on models that see spikey pointy mastery at play.