I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.
Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.
Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice.
"It's for AI" feels like lazy reasoning for me... but what IS it for?
One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers?
May be everyone is trying to take advantage of the situation before law eventually catches up.
I don’t think they mean scrapers necessarily driven by LLMs, but scrapers collecting data to train LLMs.
It's a race to the bottom. What's different is we're much closer to the bottom now.
Right, this is exactly what they are.
They're written by people who a) think they have a right to every piece of data out there, b) don't have time (or shouldn't have to bother spending time) to learn any kind of specifics of any given site and c) don't care what damage they do to anyone else as they get the data they crave.
(a) means that if you have a robots.txt, they will deliberately ignore it, even if it's structured to allow their bots to scrape all the data more efficiently. Even if you have an API, following it would require them to pay attention to your site specifically, so by (b), they will ignore that too—but they also ignore it because they are essentially treating the entire process as an adversarial one, where the people who hold the data are actively trying to hide it from them.
Now, of course, this is all purely based on my observations of their behavior. It is possible that they are, in fact, just dumb as a box of rocks...and also don't care what damage they do. (c) is clearly true regardless of other specific motives.
Why? Data. Every bit of it is it might be valuable. And not to sound tin foil hatty, but we are getting closer to a post-quantum time (if we aren't already ).
But I think what OP is implying is insecure hardware being infected by malware and access to that hardware sold as a service to disreputable actors. For that buy a good quality router and keep it up to date.
If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses.
Make sure your caches are warm and responses take no more than 5ms to construct.
Let's not forget that scrapers can be quite stupid. For example, if you have phpBB installed, which by defaults puts session ID as query parameter if cookies are disabled, many scrapers will scrape every URL numerous times, with a different session ID. Cache also doesn't help you here, since URLs are unique per visitor.
self-hosting was originally a "right" we had upon gaining access to the internet in the 90s, it was the main point of the hyper text transfer protocol.
It's painful to have your site offline because a scraper has channeled itself 17,000 layers deep through tag links (which are set to nofollow, and ignored in robots.txt, but the scraper doesn't care). And it's especially annoying when that happens on a daily basis.
Not everyone wants to put their site behind Cloudflare.
Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3
I have no idea if it actually works as advertised though. I don't think I've heard from anyone trying it.
Cloudflare will even do it for free.
We should be able to achieve close to the same results with some configuration changes.
AWS / Azure / Cloudflare total centralization means no one will be able to self host anything, which is exactly the point of this post.
That Cloudflare is trying to monetise “protection from AI” is just another grift in the sense that they can’t help themselves as a corp.
My advice to the OP is if you're not experienced enough, maybe stop taking subtle digs at AI and fire up Claude Code and ask it how to set up a LAMP stack or a simple Varnish Cache. You might find it's a lot easier than writing a blog post.
Then a poorly written crawler shows up and requests 10,000s of pages that haven't been requested recently enough to be in your cache.
I had to add a Cloudflare Captcha to the /search/ page of my blog because of my faceted search engine - which produces may thousands of unique URLs when you consider tags and dates and pagination and sort-by settings.
And that's despite me serving ever page on my site through a 15 minute Cloudflare cache!
Static only works fine for sites that have a limited number of pages. It doesn't work for sites that truly take advantage of the dynamic nature of the web.
Just to add further emphasis as to how absurd the current situation is. I host my own repositories with gotd(8) and gotwebd(8) to share within a small circle of people. There is no link on the Internet to the HTTP site served by gotwebd(8), so they fished the subdomain out of the main TLS certificate. I am getting hit once every few seconds for the last six or so months by crawlers ignoring the robots.txt (of course) and wandering aimlessly around "high-value" pages like my OpenBSD repository forks calling blame, diff, etc.
Still managing just fine to serve things to real people, despite me at times having two to three cores running at full load to serve pointless requests. Maybe I will bother to address this at some point as this is melting the ice caps and wearing my disks out, but for now I hope they will choke on the data at some point and that it will make their models worse.