It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?
If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?
You are incorrectly assuming competency, thoughtful engineering and/or some modicum of care for negative externalities. The scraper may have been whipped up by AI, and shipped an hour later after a quick 15-minute test against en.wikipedia.org.
Whoever the perpetrator is, they are hiding behind "residential IP providers" so there's no reputational risks. Further, AI companies already have a reputation for engaging in distasteful practices, but popular wisdom claims that they make up for the awfulness with utility, so even if it turns out to be a big org like OpenAI or Anthropic, people will shrug their shoulders and move on.
Residential IP providers definitely don't remove reputational risk. There are many ways people can find out what you are doing. The main one being that your employees might decide to tell on you.
The IP providers are a great way of getting around cloud flare etc. They are also reasonably expensive! I find it very plausible that these IP providers are involved but I still don't understand who is paying them.
This isn't to say every attack that looks similar is being done by Huawei (which I can't say for certain, anyway). But to me, it does look an awful lot like even large organizations you'd think would be competent can stoop to these levels. I don't have an answer for you as to why.
Anyway, I think the (currently small[1]) but growing problem is going to be individuals using AI agents to access web-pages. I think this falls under the category of the traffic that people are concerned about, even though it's under an individual users' control, and those users are ultimately accessing that information (though perhaps without seeing the ads that pay of it). AI agents are frequently zooming off and collecting hundreds of citations for an individual user, in the time that a user-agent under manual control of a human would click on a few links. Even if those links aren't all accessed, that's going to change the pattern of organic browsing for websites.
Another challenge is that with tools like Claude Cowork, users are increasingly going to be able to create their own, one-off, crawlers. I've had a couple of occasions when I've ended up crafting a crawler to answer a question, and I've had to intervene and explicitly tell Claude to "be polite", before it would build in time-delays and the like (I got temporarily blocked by NASA because I hadn't noticed Claude was hammering a 404 page).
The Web was always designed to be readable by humans and machines, so I don't see a fundamental problem now that end-users have more capability to work with machines to learn what they need. But even if we track down and sucessfully discourage bad actors, we need to work out how to adapt to the changing patterns of how good actors, empowered by better access to computation, can browse the web.
[1] - https://radar.cloudflare.com/ai-insights#ai-bot-crawler-traf...
BOTS=( "semrushbot" "petalbot" "aliyunsecbot" "amazonbot" "claudebot" "thinkbot" "perplexitybot" "openai.com/bot" )
This was really just emergency blocking and it included more than 1500 IP addresses.
Here's Amazon's page about their bot with more information including IP addresses
The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.
As for why a lot of dumb bots are interested in my web pages now, when they're already available through Common Crawl, I don't know.
I suppose it only needs one person though. So it's probably a pretty plausible explanation.
The spirit of this site is so dead. Where are the hackers? Scraping is the best anyone is coming up with?
It's not scraping. They'd notice themselves getting banned everywhere for abuse of this magnitude, which is counterproductive to scraping goals. Rather than rate-limit the queries to avoid that attention, they're going out of their way to (pay to?) route traffic through a residential botnet so they can sustain it. This is not by accident, nor a byproduct of sloppy code Claude shat out. Someone wants to operate with this degree of aggressiveness, and they do not want to be detected or stopped.
This setup is as close to real-time surveillance as can be. Someone really wants to know what is being published on target sites with as minimal a refresh rate as possible and zero interference. It's not a western governmental entity or they'd just tap it.
As for who...there's only one group on the planet so obsessed with monitoring and policing everything everyone else is doing.
Then that got passed down to the engineers and those engineers got ridden until they turned the dial to 11. Some VP then gets to go to the quarterly review with a "we beat our data ingestion metrics by 15%!".
So any engineer that pushes back basically gets told too bad, do it anyways.
I've also run into these local maxima stupidities dozens or more time in my career where it was obvious someone was gaming a performance metric at the expense of the bigger picture - which required escalation to someone who could see said bigger picture to get fixed. Happens all the time as a customer where some sales rep or sales manager wants to game short-term numbers at the expense of long-term relationships. Smaller companies you can usually get it fixed pretty quickly, larger companies tend to do more doubling down.
It usually starts with generally well-intentioned goal setting but devolves into someone optimizing a number on a spreadsheet without care (or perhaps knowledge) of the damage it can cause.
Hell, for the most extreme example look at Dieselgate. Those things don't start from some evil henchman at the top saying "lets cheat and game the metrics" - it often starts with someone setting impossible to achieve goals unknowingly in service of "setting the bar high for the organization", and by the time the backpressure filters up through the org it's oftentimes too late to fix the damage.
Your theoretical engineers would figure out pretty quickly that crashing a server slows you down and the only way to keep the boss happy is to avoid the DDOS.
The Chinese ones are hyper aggressive, with no rate limit and pure greed scraping. They'll scrape the same content hundreds of times the same day
In my experience, they do not bother putting in the effort to obfuscate source or evade bans in the first place. They might try again later, but this particular setup was specifically engineered for resiliency.
A little over a decade ago (f*ck I'm old now [0]), I had a similar conversation with an ML Researcher@Nvidia. Their response was "even if we are overtraining, it's a good problem to have because we can reduce our false negative rate".
Everyone continues to have an incentive to optimize for TP and FP at the expense of FN.
They may face less reputational damage than say Google or OpenAI would but I expect LWN has Chinese readers who would look dimly on this sort of thing. Some of those readers probably work for Alibaba and Tencent.
I'm not necessarily saying they wouldn't do it if there was some incentive to do so but I don't see the upside for them.
I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?
Is there any chance that this is a deniable attack intended to disrupt the tech industry, or even the FOSS community in particular, with training data gathered as a side benefit? I’m just struggling to understand how the economics can work here.
They are. I participate in modding communities for very niche gaming projects. All of them experienced massive DDOS attacks from AI scrappers on their websites over the past year. They are long running non-commercial projects that don’t present any business interest to anyone to be worth expending resources purely to bring them offline. They had to temporarily put the majority of their discussion boards and development resources behind a login wall to avoid having to go down completely.
I did think of a couple of possibilities:
- Someone has a software package or list of sites out there that people are using instead of building their own scrapers, so everyone hits the same targets with the same pattern.
- There are a bunch of companies chasing a (real or hoped for) “scraped data” market, perhaps overseas where overhead is lower, and there’s enough excess AI funding sloshing around that they able to scrape everything mindlessly for now. If this is the case then the problem should fix itself as funding gets tighter.
Yes. Fortunately if your hobby community is regional you can be fairly blunt in terms of blocks.
Of course they're not going to stop at just code. They need all the rest of it as well.
It's trivially easy to get claude to scrape that and regurgitate it under any requested licence (some variable names changes, but exactly the same structure - though it got one of the lookup tables wrong, which is one of the few things you could argue aren't copyrighted there).
It'll even cheerfully tell you it's fetching the repository while "thinking". And it's clearly already in the training data - you can get it to detail specifics even disallowing that.
If I referenced copywritten code we didn't have the license for (as is the case for copyleft licenses if you don't follow the restrictions) while employed as a software engineer I'd be fired pretty quick from any corporation. And rightfully so.
People seem to have a strange idea with AI that "copyleft" code is free game to unilaterally re-license. Try doing that with leaked Microsoft code - you're breaking copyright just as much there, but a lot of people seem to perceive it very differently - and not just because of risk of enforcement but in moralizing about it too.
Has it been adjudicated that AI use actually allows that? That's definitely what the AI bros want (and will loudly assert), but that doesn't mean it's true.
Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people other than being vaguely "tech" and hated by HN? At best you have someone like Altman who founded openai and had a crypto project (worldcoin), but the latter was approximately used by nobody. What about everyone else? Did Ilya Sutskever have a shitcoin a few years ago? Maybe Changpeng Zhao has an AI lab?
That was a biometric surveillance project disguised as a crypto project.
> Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people
No, the "AI" people are far worse. I always had a choice to /not/ use crypto. The "AI" people want to hamfistedly shove their flawed investment into every product under the sun.
grok will blame the zionists rather than the freemasons for that one.
"It is a DDOS attack involving tens of thousands of addresses"
It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.And if you don't care about the "residential" part you can get proxies with data center IPs for much cheaper from the same providers. But those are easily blocked
Well, you just need people to install your browser extension. Or your proprietary web browser. Or your mobile app. Or your nice MCP. Maybe get them to add your PPA repository so they automatically install your sneakily-overriden package the next time they upgrade their system.
Anything goes as long as your software has access to outgoing TCP port 443, which almost nobody blocks, so even if it's being run from within a Docker container or a VM it probably doesn't affect you.
They don't really need to scrape training data as CommonCrawl or other content archives would be fine for training data. They don't think/know to ask what they really want: training data.
In the least charitable interpretation it's anti-social assholes that have no concept or care about negative externalities that write awful naive scrapers.
For unwanted bots I serve incorrect information -- it's online gaming match history without much text so requests flagged as unwanted bots will, instead of heavy database queries, get plausibly random numbers -- seeded by the user so they stay stable -- KDA, win/loss rates, rankings.
A few dozen million distinct pages but they are numeric stats for user profiles, match stats with little to none paragraph form of text.
There is no reason for AI scrappers to use tens of thousands of IPs to scrape one site over and over.
That just sounds like a classic DDOS.
Having lots of IPs is helpful for scraping, but you don't need 10k. That's a botnet
It can absolutely be that, but that requires a confluence of multiple factors - misconfigured scrapper hitting the site over and over, a big bot net like proxy setup that is way overkilled for scrapping, a setup sophisticated enough to do all that yet simultaneously stupid enough to not cope with a site is mostly text and a couple gigs at most and all that over extended timeframe without anyone realising their scrapper is stuck.
Or alternative explanation: It's a DDOS
Also I don't know why you think this is sophisticated, it's probably 40 lines of Python code max.
I have a site which is currently being hit (over 10k requests today) and it looks like scrapers as every URL is different. If it was a DDoS, they would target costly pages like my search not every single URL.
SQLite had the same thing: https://sqlite.org/forum/forumpost/7d3eb059f81ff694 As have a few other open source repositories. It looks like badly written crawlers trying to crawl sites as fast as possible.
big tech incentivised to ddos... what a world they've built
In that case, by that rubric literally anything that you conspire with yourself to accomplish (buying next week's groceries, making a turkey sandwich...) would also be a conspiracy.
I also don't get the comments on the linked social site. IIUC the users posting there are somehow involved with kernel work, right? So they should know a thing or two about technical stuff? How / why are they so convinced that the big bad AI baddies are scraping them, and not some miss-configured thing that someone or another built? Is this their first time? Again, there's nothing there that hasn't been indexed dozens of times already. And... sorry to say it, but neither newsletters nor the 1-3 comments on each article are exactly "prime data" for any kind of training.
These people have gone full tinfoil hat and spewing hate isn't doing them any favours.
https://lwn.net/Articles/1008897
Your nonsense about LWN being a "newsletter" and having "zero valuable data" isn't doing you any favors. It is the prime source of information about Linux kernel development, and Linux development in general.
"AI" cancer scraping the same thing over and over and over again is not news for anybody even with a cursory interest in this subject. They've been doing it for years.
I mean...
Again, the site is so old that anything worth while is already in cc or any number of crawls. I am not saying they weren't scraped. I'm saying they likely weren't scraped by the bad AI people. And certainly not by AI companies trying to limit others from accessing that data (as the person who I replied to stated).
1. Coding assistants have emerged as as one of the primary commercial opportunities for AI models. As GP pointed out, LWN is the primary discussion for kernel development. If you were gathering training data for a model, and coding assistance is one of your goals, and you know of a primary sources of open source development expertise, would you:
(a) ignore it because it’s in a quaint old format, or
(b) slurp up as much as you can?
2. If you’d previously slurped it up, and are now collating data for a new training run, and you know it’s an active mailing list that will have new content since you last crawled it, would you: (a) carefully and respectfully leave it be, because you still get benefit from the previous content even though there’s now more and it’s up to date, or
(b) hoover up every last drop because anything you can do to get an edge over your competitors means you get your brief moment of glory in the benchmarks when you release?You seem to be missing my point. There is 0 incentives for AI training companies to behave like this. All that data is already in the common crawls that every lab uses. This is likely from other sources. Yet they always blame big bad AI...
some scrapers might skip out on already-scraped sources, but easy to imagine that some/many just would not bother (you don't know if it's updated until you've checked, after all). And to some extend you do have to re-scrape, if just to find links to the new stuff.
" a.getElementsByTagName = function (...args) {//Clear page content}"
One can also hide components inside Shadow DOM to make it harder to scrape.
However, these methods will interfere with automated testing tools such as Playwright and Selenium. Also, search engine indexing is likely to be affected.
Edit: Fabian2k was ten seconds ahead. Damn!