Your smart thermometer isn't making Reddit posts trying to sound like a human who's just concerned that the bedroom is a bit too warm.
If you perform simple extrapolation, the M2M data only surpass the others around 2029.
Coincidently, in the original timeline of Transformer movie, 2029 is the year that the Resistance, led by John Connor, destroyed Skynet and ended the war against the machines.
I’d love to see that crossover Terminator and Transformers movie. Optimus Prime vs T-800 anyone?
Leaving the original timeline uncertain.
Was it original, really the original? Or the 10th, or millionth loop?
Skynet can still be in our future.
Glad I found this quote. It is quite helpful for an AI to search the web on behaolf of me... even if it was finding where I can buy particular/similar peanuts locally I got from abroad.
In fact, even ads ingested by the training data set at this very moment could be useful. Go to Gemini and tell it you want to buy a jacket or whatever and it will recommend some products it ingested from the training data.
The issue in this particular case is that those content and their web servers are set up for human traffic. In the worst case, a human consumes a few megabytes of data from the server and then leaves. A few of those visits will convert into a job or business opportunity - a fair bargain. LLM scrapers are not like that. They're greedy resource hogs. They not only want everything you have, a whole bunch of them do it repeatedly and endlessly to your server. There's no possible way to justify the cost of such massive bandwidth consumption for a bunch of parasites that never give anything in return. And what do we get? A crappy user experience from all those sites putting up protection measures. This is the tragedy of the commons.
So who is the culprit? The greedy bunch who created the technology that behaves like this and then benefits immensely from it. Are those bad people? Absolutely! Naturally, we need them and their ill intentioned creations off our shared spaces. This isn't anything new. This game has been playing out in different forms since eternity.
It's just that now the official numbers say so.
But anyone on twitter or reddit can tell you the dead internet theory has been progressing at a swift pace for a decade.
AI just made it more apparent.
The current one is awful, and there's so much AI/Bot content, but I can find far more detailed information using AI enabled search that isn't covered in ads. I can get an initial overview of methodology without trawling through SEO articles.
I think AI has been almost a natural response to the enshittification of the internet - ChatGPT wouldn't seem so transformative if google search was working like google search rather than ad generator 5000 before it released.
Best thing to do is to avoid idly browsing social media and curate your internet experience.
suddenly the confirmed quality of the scraped data will be at a premium.. "Scrape Engine Optimizers" ?
With AI, we have an exponential level of productivity. But what is being produced? 90%: garbage.
The problem is that what is being produced is essentially "garbage" generated by models trained on garbage. Quality knowledge is increasingly submerged and suffocated by spam and low-quality content.
The real challenge of the future will be filtering and cleaning up, on each level.
We already see this with synthetic training data that basically uses logic in form of math and code as constraint.
I've heard this argument before, but you don't need to think too hard to see the limitations of a machine with no senses.
Only if you assume that people who train models are stupid.
And it's simply not reasonable for AI companies to have human hands read through individual comments everywhere from beginning to end to build their training data. There isn't enough time in the universe to advance AI while doing that and also being accurate. Something will always slip through.
Why would human review be the only possible way to remove enough of the tainted training data?
> Where are they supposed to find content to train AI on that isn't polluted with AI content that'll result in a feedback loop?
If nothing else: you could look for old data. At the moment, training assumes that input data is essentially without limit. But machine learning has lots and lots of old and proven techniques for what to do when your training data is limited.
You can also look into techniques for avoiding model collapse. Just because one group of researcher showed that this happens with some specific models, doesn't mean it needs to happen in general.
And even that needs to be curated because before AI tools there was bot content filling up the internet.
...and even without bots, a lot of human authored content are low value, poorly written, etc.
There are (probably) companies out there whose business is to create, curate and improve training sets.
Probably the only real way to validate content is real is building a validation system into devices. Confirm when a photo is taken and send an ID to a server, then when photos are shared, its ID is compared to the image on the camera/phone manufacturer's server. For text, validate every little key press. And there are still ways to game these systems, but I would not be surprised if they're introduced to mitigate AI diffusing everywhere.
Eg by filtering data, by procuring better data, by applying techniques for making do with more limited data (we used to have a lot of those, and they are still known), or you can also adapt your training process to be less vulnerable to model collapse. Just because some researchers have shown that this happened for the models they tested, doesn't mean it has to be a universal thing.
Someone in the chain will be. Even the smartest people buy a lot of their training datasets. What happens when those get contaminated?
Filters are also not 100% infallible
> And you negotiate a contract where the seller bears some of that risk
So the training data will be polluted anyway, but "the seller will bear some risk"
I'm a creator of such content, and like everyone else, I have to make do with 60-70% less traffic now.
It's just harder when you cut all traffic to them, devalue their work and fill the air with AI noise.
We'll have the internet we deserve
Marx, Nietsche, Debord, Foucault, Baudrillard, Adorno - they already saw writing on the wall, or at least fragments of it.
Which means filtering and ranking systems become the main bottleneck.
That pushes platforms toward stronger algorithmic selection and sometimes stronger convergence of attention.
Once content gets cheap, the winners are less likely to be the best creators and more likely to be the strongest gatekeepers.