Why is not this a criminal offense? They are hurting business for profit (or for higher valuation as they probably have no profit at all).
Why are corporations allowed to do with impunity what could land even a teenager years in prison? Is there no rule of law anymore?
The five-year and ten-year penalties kick in only when the government can show the offense caused at least $5,000 in losses across all victims during a one-year period. https://legalclarity.org/what-are-the-punishments-for-a-ddos...
Those laws are intended to protect corporations. If corporations are the ones doing the scraping, it doesn't make sense for the same laws to affect them.
I repeat what Aaron’s friends and lawyers said at the time: we were going to fight that case, and we were going to win.
Acme.com is welcome to require authentication for all pages but their home page, which would quickly cause the traffic to drop. They don't want to do this - like the coffee shop, they want to be open to public, and for good reasons.
Sometimes the use profile changes dramatically in a short time. 15 years ago, Netflix created the video streaming market and shared bandwidth capacity that had been excessive before wasn't enough. 15 years before that, Google did the same thing when they created search and started driving tremendous traffic to text based websites which had spread through word of mouth before.
Turns out the micro transaction people probably had the right idea.
I've had to deploy a combination of Cloudflare's bot protection and anubis on over 200 domains across 8 different hosting environments in the last 2 months. I have small business clients that couldn't access their sales and support platforms because their websites that normally see tens of thousands of unique sessions per day are suddenly seeing over a million in an hour.
Anthropic and OpenAI were responsible for over 70% of that traffic.
Have you not been paying attention to the news for the past few years?
No, there isn't. If there were, Trump would be in prison, not the Oval Office. And he and the Republican Party have deliberately fostered this environment of corruption and rule-by-wealth so that they can gain more power and even more wealth.
And now they are also backing the AI zealots, and techbros more generally, to ensure that they can do whatever the hell they want, damn the consequences to the rest of the world.
How do you think search engines work?
If you train an LLM, it's not like you keep a copy of every page around, so there's no point to check if you need to re-scrape a page, you do, because you store nothing.
Personally I think people would be pretty indifferent to the new generation of scrapers, AI or other types, if they at least behaved and slowed down if they notice a site struggling. If they had the slightest bit of respect for others on the web, this wouldn't be an issue.
And they give you real, valuable traffic in return.
It's the same problem as why Occupy Wallstreet fell apart: bunch of losers who don't understand the system screech about the system. because they don't understand it, they can't offer any meaningful dialogue about how to fix it beyond screeching.
If you search for `site:github.com "acme.com"` in Google, you'll find numerous instances of the domain being used in contrived links in documentation as an example of how URLs might be structured on an arbitrary domain and also in issues to demonstrate a fully qualified URL without giving away the actual domain people were using.
This means that numerous links are pointing to non-existent paths on `acme.com` because of the nature of how people are using them in documentation and examples.
But it is not necessary to see the results that are being described.
If sites like my tiny little browser game, with roughly 120 weekly unique users, are getting absolutely hammered by the scraper-bots (it was, last year, until I put the Wiki behind a login wall; now I still get a significant amount of bot traffic, it's just no longer enough to actually crash the game), then sites that people actually know and consider important like acme.com are very likely to be getting massive deluges of traffic purely from first-order hits.
Having honeypot links is the only thing that helps, but I'm running into massive IP tables, slowing things down.
This is not what I want to do with my time. I can't afford the expensive specialised tools. I'm just a solo entrepreneur on a shoestring budget. I just want to improve the website for my 3k real users and 10k real daily guests, not for bots.
Small site operators like us know very well that the utility they can get by scraping us is marginal at best. Based on their patterns of behavior, though, my best guess is that they've simply configured their bots to scrape absolutely everything, all the time, forever, as aggressively as possible, and treat any attempt to indicate "hey, this data isn't useful to you" as an adversarial signal that the site operator is trying to hide things from them that are their God-given right.
> Now closing https service is obviously just a temporary fix
Probably the best starting point would be to edit the robots.txt file and disallow LLM bots there. Currently the file allows all bots: http://acme.com/robots.txt
Every run is basically a fresh run, no state stored, every page is just feed into the machine a new. At least that's my theory.
The AI companies need a full copy of your page, every time they retrain a model. Now they could store that in their own datacenters, but that's a full copy of the internet, in a market where storage costs are already pretty high. So instead, they just externalize the storage cost. If you run a website, a public Gitlab instance, Forgejo, a wiki, a forum, whatever, you basically functions as free offsite storage for the AI companies.
1) https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
Do any webservers have a feature where they keep a list in memory of files/paths that exist?
They usually request something deep like /foo/bar/login.html as part of their reconnaissance.
I'm up to 4 pages of filter rules after the massive IP blacklist.
These assholes are also scanning every address on the IPv4 internet and hoovering up the content.
To answer your first question: No, that's the OS's job. But some clever rules could be setup for filtering invalid requests depending on your web server.
* global distributed caching of content. This reduces the static load on our servers and bandwidth usages to essentially 0, and since it is served at an end point closest to wherever the client is, they get less latency. This includes user logged in specifics as well.
* shared precached common libraries (ie. jquery, etc) for faster client load times
* Offers automated minification of JS, CSS, and HTML, along with image optimization (serve size and resolution of image specific to the device user is viewing it from) to increase speed
* always up mode (even if my server is down for some reason, I can continue to serve static content)
* detailed analytics and reporting on usage / visitors
There are a lot more, but those are a few that come to mind.
I also added various rate limits such as 1 RPS to my expensive SSR pages, after which a visitor gets challenged. Again this blocks bots without harming power users much.
Who do you think writes these scrapers? Well, I mean aside from the vibe coded ones.