Poisoning Well(heydonworks.com)

118 pointsby wonger_8 days ago19 comments

kitku8 days ago
This reminds me of the Nepenthes tarpit [1], which is an endless source of ad-hoc generated garbled mess which links to itself over and over.
Probably more effective at poisoning the dataset if one has the resources to run it.
[1]: https://zadzmo.org/code/nepenthes/
- fleebee8 days ago
  I'm running Iocaine[1] which is essentially the same thing on my tiny $3/mo VPS and it's handling crawlers bombarding the honeypot with ~12 requests per second just fine. It's using about 30 MB of RAM.
  [1]: https://iocaine.madhouse-project.org/
  - treetalker8 days ago
    Odorless, tasteless, and among the more deadly poisons known to crawlers!
    BrenBarn7 days ago
    Unfortunately they will spend the next several years building up an immunity.
- 8organicbits7 days ago
  Do we know if LLM scrapers are running JavaScript on the pages? If they are, maybe it's worth offloading the Markov model to the client side.
nvader7 days ago
A link to the poisoned version of the same article:
https://heydonworks.com/nonsense/poisoning-well/
- 1970-01-017 days ago
  Crazy how close to coherent it reads, yet it clearly is gibberish.
  - rightbyte7 days ago
    Ye there is this flow. If you remember how ChatGPT was at release, it like slided into gibberish and you wouldn't instantly notice but a sentence or two in.
  - heelix7 days ago
    Similar thought when I looked at the above link. My pre-coffee brain was parsing along initially, and then it seemed to drift into silly - all while looking correct. I got to believe that sort of input would be toxic to a model.
neuroelectron8 days ago
Kind of too late for this. The ground truth of models has already been established. That' why we see models converging. They will automatically reject this kind of poison.
- nine_k8 days ago
  This will remain so as long as the models don't need to ingest any new information. If most novel texts will appear with slightly more insidious nonsense mirrors, LLMs would either have to stay without this knowledge, or start respecting "nofollow".
- sevensor7 days ago
  I don’t know about that. Have you seen their output? They’re poisoning their own well with specious nonsense text.
- blagie8 days ago
  It's competition. Poison increases in toxicity over time.
  I could generate subtly wrong information on the internet LLMs would continue to swallow up.
  - latexr8 days ago
    > I could generate subtly wrong information on the internet
    There’s already a website for that. It’s called Reddit.
  - wewtyflakes7 days ago
    Yes, but so would people, so what's the point of this unless you just dislike everybody (and if so, that's fair too I suppose)?
    blagie5 days ago
    The whole design was based on things to make pages not discoverable except to sketchy AI companies who violate web norms.
hosh7 days ago
There is a project called Iocaine that does something similar while trying to minimize resource use.
- agar7 days ago
  Isn't the Iocaine developer from Australia?
8organicbits7 days ago
The robots.txt used here only tells GoogleBot to avoid /nonsense/. It would be nice to tell other web crawlers too otherwise your poisoning everyone but Google, not just crawlers that ignore robots.txt
joncp7 days ago
That’ll end up in an arms race where you refine the gibberish to be more and more believable while the crawlers get better and better at detecting poison wells. The end state is where your fake pages are so close to the real thing that humans can’t tell the difference
- robocat6 days ago
  Humans have been refining their gibberish for centuries.
  I admire your idealism that "the real thing" is coherently different from good gibberish. The poisoned version of the same article is great: https://heydonworks.com/nonsense/poisoning-well/ (I love me some surrealism so gibberish is something I sometimes choose to input into my own model in my head).
  The scary part of AI is that it shows how crappy most of the training material is.
djoldman7 days ago
> I’m not drab I want the base to end this nobody.
I don't know why but the examples are hilarious to me.
wolvesechoes7 days ago
I am fine with poisoning the Google's well too. I rely on people recommending my blog to other people if they find it interesting, so ranking means nothing to me.
johnnienaked7 days ago
>Since most of what they consume is on the open web, it’s difficult for authors to withhold consent without also depriving legitimate agents (AKA humans or “meat bags”) of information.
It shouldn't be difficult at all. When you record original music, or write something down on paper, it's instantly copyrighted. Why shouldn't that same legal precedent apply to content on the internet?
This is half a failure of elected representatives to do their jobs, and half amoral tech companies exploiting legal loopholes. Normal people almost universally agree something needs to be done about it, and the conversation is not a new one either.
Popeyes8 days ago
What is this war about?
I was looking at another thread about how Wikipedia was the best thing on the internet. But they only got the head start by taking copy of Encyclopedia Britannica and everything else is a
And now the corpus is collected, what difference does a blog post make, does it nudge the dial to comprehension 0.001% in a better direction? How many blog posts over how many weeks makes the difference.
- nvader7 days ago
  > they only got the head start by taking copy of Encyclopedia Britannica
  Wikipedia used a version of Encyclopedia Britannica that was in the public domain.
  Go thou and do likewise.
- simonw8 days ago
  This is the first I've heard of Wikipedia starting with a copy of Britannia. Where did you see that?
  - simonw8 days ago
    OK, found it: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Encyclop...
    "Starting in 2006, much of the still-useful text in the 1911 Encyclopaedia was adapted and absorbed into Wikipedia. Special focus was given to topics that had no equivalent in Wikipedia at the time. The process we used is outlined at the end of this page."
    Wikipedia started in 2001. Looks like they absorbed a bunch of out-of-copyright Britannica 1911 content five years later.
    There are still 13,000 pages on Wikipedia today that are tagged as deriving from that project: https://en.m.wikipedia.org/wiki/Template:EB1911
- collinmcnulty7 days ago
  It is about imposing costs on poorly behaved scraping in an attempt to change the scrapers behavior, under the assumption that the scrapers' creators are anti-social but economically rational. One blog doesn't make a huge difference but if enough new blogs contain tarpits that cost the scraper as much as the benefit of 100 other non-tarpit blogs, maybe the calculus for doing any new scraping changes and the scrapers start behaving.
tqwhite7 days ago
I continue to have contempt for the "I'm not contributing to the enrichment of our newest and most powerful technology" gang. I do not accept the assertion that my AI should have any less access to the internet that we all pay for than I do.
If guys like this have their way, AI will remain stupid and limited and we will all be worse off for it.
- thoroughburro7 days ago
  I do not accept the assertion that the content I host is anything other than my own to serve to whom I wish. Get your hands off my belongings, freeloader.
  Or, shorter: I hold you in as much contempt as you hold me.
  > If guys like this have their way, AI will remain stupid and limited
  AI doesn’t have a right to my stuff just because it’ll suck without it.
  - tqwhite7 days ago
    You put it on the public internet, the public has the right to see it, take whatever lessons it can and do what it wants. It's the intrinsic nature of the internet.
    It has never been ok to say, "[this kind of person] is not allowed to look at my page nor, "you can only look at my page with [this web browser or other tool]".
    thoroughburro5 days ago
    > It has never been ok to say, "[this kind of person] is not allowed to look at my page nor, "you can only look at my page with [this web browser or other tool]".
    What? Sure it has. “This page best viewed in Internet Explorer.” “JavaScript required.” “Members only.” “Aggressive scrapers blocked.” “Random country blocked just for having higher proportion undesirable traffic.”
    All bog standard for decades.
- nkrisc7 days ago
  > I continue to have contempt for the "I'm not contributing to the enrichment of our newest and most powerful technology" gang.
  Ok, when do I get paid for my contribution?
  - tqwhite7 days ago
    If you have posted public web pages then the fact of your being able to be seen by people using the public network infrastructure is your payment.
- IncreasePosts7 days ago
  Okay, what percent of content is poison like this? Way less than 1%? If AI is so smart maybe it can figure out what content in a training set is jibberish and what isn't.
  Anyway, a big problem people have isn't "AI bad", it's "AI crawlers bad", because they eat up a huge chunk of your bandwidth by behaving badly for content you intend to serve to humans.
  - tqwhite7 days ago
    The amount of bandwidth they use is trivial. That's not an issue.
    I don't have any idea if the poisoning idea is real or not. I express contempt for those that think it's a good idea. It's not just my support for AI learning from the public internet, there are also people who will see it. It's unethical.
- tqwhite7 days ago
  If you put stuff outside, in public, then it is correct for people to observe and learn from it. You should be happy that your web presence is enriching our culture. If not, you're selfish and I think you're bad.
  If you don't want it to contribute to the public use, make a login page. I see no reason for you to get the advantage of a public access without contributing to the community.
7 days ago
undefined
bboygravity8 days ago
I find this whole anti-LLM stance so weird. It kind of feels like trying to build robot distractions into websites to distract search engine indexers in the 2000's or something.
Like why? Don't you want people to read your content? Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?
Meanwhile, the rest of the world is trying to figure out how to deliberately get their stuff INTO as many LLMs as fast as possible.
- tpxl8 days ago
  > Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?
  Yes, it matters a lot.
  You know of authors by name because you read their works under their name. This has allowed them to profit (not necessarily in direct monetary value) and publish more works. Chucking everything into a LLM takes the profit from individual authors and puts them into pockets of gigacorporations.
  Not to mention the facts the current generation of LLMs will straight up hallucinate things, sometimes turning the message you're trying to send on its head.
  Then there's the question of copyright. I can't pirate a movie, but Facebook can pirate whole libraries, create a LLM and sell it and it's OK? I'd have a lot less of an issue if this was done ethically.
  - kulahan7 days ago
    Does it really matter when, previously, the exact same problem existed in the form of Google Cards in your search results? ;)
    nvader7 days ago
    The presense of an earlier problem does not solve, or make less severe, a later problem.
    Why are you winking?
    kulahan7 days ago
    Because it’s not a very serious comment, and yes, of course it makes future problems less severe. That’s such a weird and impossible-to-defend take?
    “Well we’ve had this exact problem for decades but now the same problem is instead coming from elsewhere so this is now completely different” makes zero sense.
    As an aside, a wink should send a pretty obvious message. I think you’re taking this generic internet conversation too personally.
    ninalanyon7 days ago
    It's not completely different, we simply failed to fix it the first time around.
    kulahan6 days ago
    Yes, my point all along is that it’s not different at all, so it’s a weird complaint to hear so often when it was relatively few complaining about the cards.
    tpxl6 days ago
    The preview in search results has the author (web page) prominently displayed. LLMs dont (and likely never will).
- InsideOutSanta8 days ago
  I think at least partially, it's not an anti-LLM stance, it's an anti-"kill my website" stance. Many LLM crawlers behave very poorly, hurting websites in the process. One result of that is that website owners are forced to use things like Anubis, which has the side-effect of hurting everybody else, too.
  I prefer this approach because it specifically targets problematic behavior without impacting clients who don't engage in it.
- aucisson_masque8 days ago
  I think it’s obvious.
  In simpler terms, it comes down to the « you made this ?, I made this » meme.
  Now if your ‘content’ is garbage that takes longer to publish than to write, I can get your point of view.
  But for the authors who write articles that people actually want to read, because it’s interesting and well written, it’s like robbery.
  Unlike humans, you can’t say that LLM create new things from what they read. LLM just sum up and repeat, evaluating with algorithms what word should be next.
  Meanwhile humans… Oscar Wilde — 'I have spent most of the day putting in a comma and the rest of the day taking it out.'
  - snowram8 days ago
    LLM can create new things, since their whole purpose is to interpolate concepts in a latent space. Unfortunately they are mostly used to regurgitate verbatim what they learned, see the whole AI Ghibli craze. Blame people and their narrow imagination.
- HankStallone8 days ago
  > Don't you want people to read your content?
  People, yes. Well-behaved crawlers that follow established methods to prevent overload and obey established restrictions like robots.txt, yes. Bots that ignore all that and hammer my site dozens of times a second, no.
  I don't see the odds of someone finding my site through an LLM being high enough to put up with all the bad behavior. In my own use of LLMs, they only occasionally include links, and even more rarely do I click on one. The chance that an LLM is going to use my content and include a link to it, and that the answer will leave something out so the person needs to click the link for more, seems very remote.
- wonger_8 days ago
  I can think of several anti-LLM sentiments right now:
  - developers upset with the threat of losing their jobs or making their jobs more dreadful
  - craftspeople upset with the rise in slop
  - teachers upset with the consequences of students using LLMs irresponsibly
  - and most recently, webmasters upset that LLM services are crawling their servers irresponsibly
  Maybe the LLMs don't seem so hostile if you don't fall into those categories? I understand some pro-LLM sentiments, like content creators trying to gain visibility, or developers finding productivity gains. But I think that for many people here, the cons outweigh the pros, and little acts of resistance like this "poisoning well" resonate with them. https://chronicles.mad-scientist.club/cgi-bin/guestbook.pl is another example.
  - threetonesun7 days ago
    You forgot the big one, every head of the major AI companies had dinner with a fascist the other day, and we already know they have their thumbs on the scale of weighting responses. It's more reasonable to say that the well is already poisoned.
- nottorp8 days ago
  > Don't you want people to read your content?
  There's the problem right there. If all you produce is "content" your position makes sense.
  - kulahan7 days ago
    Can you elaborate on what this means? Because I’m not sure which alternative you’re suggesting exists to put on a website besides content.
    zem6 days ago
    "content" suggests "I wrote this to have something on my webpage", as opposed to "writing", which suggests "I made a webpage to have somewhere to share this"
    kulahan5 days ago
    Interesting, thanks
- timdiggerm8 days ago
  If it's ad-supported or I'm seeking donations, I only want people reading it on my website. Why would I want people to access it through an LLM?
- LtWorf8 days ago
  I don't want having to pay extra money for vibe-coded LLMs companies bots to scrape my website constantly, ignoring cache headers and the likes.
  Every single person who has wrote a book is happy if others read their book. They might be less enthusiastic about printing million copies and shipping them to random people with their own money.
  - 8 days ago
    undefined
karahime7 days ago
Personally, I would not want to be on the side of people openly saying that they are poisoning the well.
- rideontime7 days ago
  A well drilled on my property, without even asking for permission, let alone receiving it? I'll happily take that side.
wilg7 days ago
In my opinion, colonialism was significantly worse than web crawlers being used to train LLMs.
- amelius7 days ago
  We're only at the beginning.
simonw8 days ago
There are two common misconceptions in this post.
The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots
Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.
The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.
Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.
- vintermann8 days ago
  There are definitively scrapers that ignore your robots.txt file. Whether they're some "Enemy State" LLM outfit, an "Allied State" corporation outsourcing their dirty work a step or two, or just some data hoarder worried that the web as we know it is going away soon, everyone is saying they're a problem lately, I don't think everyone is lying.
  But it's certainly also true that anyone feeding the scrapings to an LLM will filter it first. It's very naive of this author to think that his adlib-spun prose won't get detected and filtered out long before it's used for training. Even the pre-LLM internet had endless pages of this sort of thing, from aspiring SEO spammers. Yes, you're wasting a bit of the scraper's resources, but you can bet they're already calculating in that waste.
  - simonw8 days ago
    There are definitively scrapers that ignore your robots.txt file
    Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"
    The exact quote from the article that I'm pushing back on here is:
    "If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality."
    Which appears directly below this:
    User-agent: GPTBot Disallow: /
    simoncion8 days ago
    > But those aren't the ones that explicitly say "here is how to block us in robots.txt"
    Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it.
    Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent.
    [0] because of the expected total cost of licensing fees
    [1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons
    simonw8 days ago
    I agree that AI companies do all sorts of shady stuff to accumulate training data. See Anthropic's recent lawsuit which I covered here: https://simonwillison.net/2025/Jun/24/anthropic-training/
    That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't. Saying "we will obey your robots.txt file" and lying about it is a different category of shady. I care about that difference.
    simoncion4 days ago
    > That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't.
    Ah, good. So you have solid evidence that they're NOT doing shady stuff. Great! Let's have it.
    "It's unfair to require me to prove a negative!" you say? Sure, that's a fair objection... but my counter to that is "We'll only get solid evidence of dirty dealings if an insider turns stool pidgeon.". So, given that we're certainly not going to get solid evidence, we must base our evaluation on the behavior of the companies in other big projects.
    Over the past few decades, Google, Facebook, and Microsoft have not demonstrated that they're dedicated to behaving ethically. (And their behavior has gotten far, far worse over the past few years.) OpenAI's CEO is plainly and obviously a manipulator and savvy political operator. (Remember how he once declared that it was vitally important that he could be fired?) Anthropic's CEO just keeps lying to the press [0] in order to keep fueling AGI hype.
    [0] Oh, pardon me. He's "making a large volume of forward-looking statements that -due to ever-evolving market conditions- turn out to be inaccurate". I often get that concept confused with "lying". My bad.
    simonw4 days ago
    So call them out for the bad stuff! Don't distract from the genuine problems by making up stuff about them ignoring robots.txt directives despite their documentation clearly explaining how those are handled.
    simoncion4 days ago
    > So call them out for the bad stuff!
    I am. I am also saying -because the companies involved have demonstrated that they're either frequently willing to do things that are scummy as shit or "just" have executives that make a habit of lying to the press in order to keep the hype train rolling- that's it's very, very likely that they're quietly engaging in antisocial behavior in order to make development of their projects some combination of easier, quicker, or cheaper.
    > Don't distract from the genuine problems by making up stuff...
    Right back at you. You said:
    > There are definitively scrapers that ignore your robots.txt file Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"
    But, you don't have any proof of that. This is pure speculation on your part. Given the frequency of and degree to which the major companies involved in this ongoing research project engage in antisocial behavior [0] it's more likely than not that they are doing shady shit. As I mentioned, there's a ton of theoretical money on the line.
    The unfortunate thing for us is that neither of us can do anything other than speculate... unless an insider turns informant.
    [0] ...and given how the expected penalties for engaging in most of the antisocial behavior that's relevant to the AI research project is somewhere between "absolutely nothing at all" and "maybe six to twelve months of expected revenue"...
    cyphar8 days ago
    Maybe I'm the outlier here, but I think intentionally torrenting millions of books and taking great pains to try to avoid linking the activity to your company is far beyond something as "trivial" as ignoring robots.txt. This is like wringing your hands over whether a serial killer also jaywalked on their way to the crimescene.
    (In theory the former is supposed to be a capital-C criminal offence -- felony copyright infringement.)
    spacebuffer7 days ago
    Your initial comment made sense after reading through the openai docpage. so I opened up my site to add those to robots.txt, turns out I had added all 3 of those user-agents to my robots file [0], out of curiosity I asked chatgpt about my site and it did scrape it, it even mentioned articles that have been published after adding the robots file
    [0]: https://yusuf.fyi/robots.txt
    simonw7 days ago
    Can you share that ChatGPT transcript?
    One guess: ChatGPT uses Bing for search queries, and your robots.txt doesn't block Bing. If that's what is happening here I agree that this is really confusing and should be clarified by the OpenAI bots page.
    spacebuffer7 days ago
    Here you go: https://chatgpt.com/share/68bc125a-9e9c-8005-9c9f-298dbd541d...
    simonw7 days ago
    Yeah, wow that's a lot of information for a site that's supposedly blocked using robots.txt!
    My best guess is that this is a Bing thing - ChatGPT uses Bing as their search partner (though they don't make that very obvious at all), and BingBot isnt't blocked by your site.
    I think OpenAI need to be a whole lot more transparent about this. It's very misleading to block their crawlers and have it not make any difference at all to the search results returns within ChatGPT.
- Retric8 days ago
  > The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it:
  That’s testable and you can find content “protected” by robots.txt regurgitated by LLM’s. In practice it doesn’t matter if that’s through companies lying or some 3rd party scraping your content and then getting scraped.
  - simonw8 days ago
    There's a subtle but important difference between crawling data to train a model and accessing data as part of responding to a prompt and then piping that content into the context in order to summarize it (which may be what you mean by "regurgitation" here, I'm not sure.)
    I think that distinction is lost on a lot of people, which is understandable.
  - simonw8 days ago
    Do you have an example that demonstrates that?
    whilenot-dev8 days ago
    User Agent "Perplexity‑User"[0]:
    > Since a user requested the fetch, this fetcher generally ignores robots.txt rules.
    [0]: https://docs.perplexity.ai/guides/bots
    Lerc8 days ago
    There's definitely a distinction between fetching data for training and fetching data as an agent on behalf of a user. I guess you could demand that any program that identifies itself as a user agent should be blocked, but it seems counterproductive.
    theamk7 days ago
    Counterproductive for what?
    If I am writing for entertainment value, I see no problem with blocking all AI agents - the goal of text is to be read by humans after all.
    For technical texts, one might want to block AI agents as well - they often omit critical parts and hallucinate. If you want your "DON'T DO THIS" sections to be read, better block them.
    nine_k8 days ago
    But this is more like `curl https://some/url/...` ignoring robots.txt.
    Crawlers are the thing that should honor robots.txt, "nofollow", etc.
    whilenot-dev8 days ago
    The title of this site is "Perplexity Crawlers"...
    And it clearly voids simonw's stance
    > it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it
    simonw8 days ago
    Where did Perplexity say they would obey robots.txt?
    They explicitly document that they do not obey robots.txt for that form of crawling (user-triggered, data is not gathered for training.)
    Their documentation is very clear: https://docs.perplexity.ai/guides/bots
    whilenot-dev8 days ago
    I thought your emphasis was on the "it's the idea that LLM vendors ignore your robots.txt file" part of your statement... Now your point is it's okay for them to ignore it because they announce that their crawler "ignores robots.txt rules"?
    simonw8 days ago
    No, my argument here is that it's not OK to say "but OpenAI are obviously lying about following robots.txt for their training crawler" when their documentation says they obey robots.txt.
    There's plenty to criticize AI companies for. I think it's better to stick to things that are true.
    latexr8 days ago
    > when their documentation says
    Their documentation could say Sam Altman is the queen of England. It wouldn’t make it true. OpenAI has been repeatedly caught lying about respecting robots.txt.
    https://www.businessinsider.com/openai-anthropic-ai-ignore-r...
    https://web.archive.org/web/20250802052421/https://mailman.n...
    https://www.reddit.com/r/AskProgramming/comments/1i15gxq/ope...
    simonw8 days ago
    Those three links don't support your argument here.
    The Business Insider one is a paywalled rehash of this Reuters story https://www.reuters.com/technology/artificial-intelligence/m... - which was itself a report based on some data-driven PR by a startup, TollBit, who sell anti-scraping technology. Here's that report: https://tollbit.com/bots/24q4/
    I downloaded a copy and found it actually says "OpenAI respects the signals provided by content owners via robots.txt allowing them to disallow any or all of its crawlers". I don't know where the idea that TollBit say OpenAI don't obey robots.txt comes from.
    The second one is someone saying that their site which didn't use robots.txt was aggressively crawled.
    The third one claims to prove OpenAI are ignoring robots.txt but shows request logs for user-agent ChatGPT-User which is NOT the same thing as GPTBot, as documented on https://platform.openai.com/docs/bots
    whilenot-dev8 days ago
    Agree, and in this same vein TFA was talking about LLMs in general, not OpenAI specifically. While I get your concern and would also like to avoid any sensationalism, there is still this itch about the careful wording in all these company statements.
    For example, who's the "user" that "ask[s] Perplexity a question" here? Putting on my software engineer hat with its urge for automation, it could very well be that Perplexity maintains a list of all the sites blocked for the PerplexityBot user agent through robots.txt rules. Such a list would help for crawling optimization, but could also be used to later have any employer asking Perplexity a certain question that would attempt to re-crawl the site with the Perplexity‑User user agent anyway (the one ignoring robot.txt rules). Call it the work of the QA department.
    Unless we'd work for such a company in a high position we'd never really know, and the existing violations of trust - just in regard to copyrighted works alone(!) - is enough rightful reason to keep a certain mistrust by default when it comes to young companies that are already evaluated in the billions and the handling of their most basic resources.
- rozab8 days ago
  After I set up a self hosted git forge a little while ago, I found that within minutes it immediately got hammered by OpenAI, Anthropic, etc. They were extremely aggressive, grabbing every individual file from every individual commit, one at a time.
  I hadn't backlinked the site anywhere and was just testing, so I hadn't thought to put up a robots.txt. They must have found me through my cert registration.
  After I put up my robots.txt (with explicit UA blocks instead of wildcards, I heard some ignore them), I found after a day or so the scraping stopped completely. The only ones I get now are vulnerability scanners, or random spiders taking just the homepage.
  I know my site is of no consequence, but for those claiming OpenAI et al ignore robots.txt I would really like to see some evidence. They are evil and disrespectful and I'm gutted they stole my code for profit, but I'm still sceptical of these claims.
  Cloudflare have done lots of work here and have never mentioned crawlers ignoring robots.txt:
  https://blog.cloudflare.com/control-content-use-for-ai-train...
- CrossVR8 days ago
  > Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.
  Even if the large LLM vendors respect it, there's enough venture capital going around that plenty of smaller vendors are attempting to train their own LLMs and they'll take every edge they can get, robots.txt be damned.
  - simonw8 days ago
    Yeah this is definitely true.
- flir8 days ago
  > The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots
  So, uh... where's all the extra traffic coming from?
  - simonw8 days ago
    All of the badly behaved crawlers.
    flir8 days ago
    Yeah, I read the rest of the conversation and tried to delete. I understand your point now. Apologies.
- hooloovoo_zoo8 days ago
  Your link does not say they will obey it.
  - simonw8 days ago
    Direct quote from https://platform.openai.com/docs/bots
    "OpenAI uses the following robots.txt tags to enable webmasters to manage how their sites and content work with AI."
    Then for GPT it says:
    "GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models."
    What are you seeing here that I'm missing?
    hooloovoo_zoo8 days ago
    My read is that they are describing functionality for site owners to provide input about what the site owner thinks should happen. OpenAI is not promising that is what WILL happen, even in the narrow context of that specific bot.
- charles_f8 days ago
  I somewhat agree with your viewpoint on copyright, but what terrifies me is VCs like a16z or Sequoia simultaneously backing up large LLMs profiting from ignoring copyright and media firms where they'll use whatever power and lobby they have to protect copyright.
  I don't think the content I produce is worth that much, I'm glad if it can serve anyone, but I find amusing the idea to poison the well
fastball8 days ago
> According to Google, it’s possible to verify Googlebot by matching the crawler’s IP against a list of published Googlebot IPs. This is rather technical and highly intensive
Wat. Blocklisting IPs is not very technical (for someone running a website that knows + cares about crawling) and is definitely not intensive. Fetch IP list, add to blocklist. Repeat daily with cronjob.
Would take an LLM (heh) 10 seconds to write you the necessary script.
- kinix8 days ago
  From how I read it, the author seems to be suggesting that this list of IPs be on an allowlist as they see Google as "less nefarious". As such, sure, allowing google IPs is as easy as allowing all IPs, but discerning who are "nefarious actors" is probably harder.
  A more tounge-in-cheek point: all scripts take an LLM ~10 seconds to write, doesn't mean it's right though.
- simonw8 days ago
  Looks like OpenAI publish the IPs of their training data crawler too: https://openai.com/gptbot.json
deadbabe8 days ago
Not every bot that ignores your robots.txt is necessarily using that data.
What some bots do is they first scrape the whole site, then look at which parts are covered by robots.txt, and then store that portion of the website under an “ignored” flag.
This way, if your robots.txt changes later, they don’t have to scrape the whole site again, they can just turn off the ignored flag.
- cyphar8 days ago
  Ah, so the NSA defence then -- "it's not bulk collection because it only counts as collection when we look at it".
- nvader7 days ago
  Not every intruder who enters your home is necessarily a burglar.
- imtringued8 days ago
  Your post-rationalization just doubles down on the stance that these crawlers are abusive and poorly developed.
  You're also under the blatantly wrong misconception that people are worried about their data, when they are worried about the load of a poorly configured crawler.
  The crawler will scrape the whole website on a regular interval anyway, so what is the point of this "optimization" that optimizes for highly infrequent events?
protocolture8 days ago
>One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn’t theirs to consume.
One of the many pressing issues is that people believe that ownership of content should be absolute, that hammer makers should be able to dictate what is made with hammers they sell. This is absolutely poison as a concept.
Content belongs to everyone. Creators of content have a limited term, limited right to exploit that content. They should be protected from perfect reconstruction and sale of that content, and nothing else. Every IP law counter to that is toxic to culture and society.
- kstrauser8 days ago
  Tencent scrapers are hitting my little Forgejo site 4 times a second, 24/7. I pay for that bandwidth. Platitudes sound great, but this isn’t a lofty “drinking from the public well”. This is bastard operators taking a drink and pooping in it.
  My thoughts will have more room for nuance when they stop abusing the hell out of my resources they’re “borrowing”.
  - psychoslave8 days ago
    Why are there even doing so? This doesn’t feel like something that can even bring any value downstream to their own selfish pipelines, or am I missing something?
    kstrauser8 days ago
    No! They’re constantly hitting the same stupid URL (“show me this file in this commit in this repo with these 47 query params”) from a few thousand IPs in China and Brazil, with user agents showing an iPod or a Linux desktop running Opera 3.
    I wrote a little script where I throw in an IP and it generates a Caddy IP-matcher block with an “abort” rule for every netblock in that IP’s ASN. I’m sure there are more elegant ways to share my work with the world while blocking the scoundrels, but this is kind of satisfying for the moment.
    danaris8 days ago
    Best I can figure, they've decided that it's easier to set up their scrapers to simply scrape absolutely everything, all the time, forever than to more carefully select what's worth it to get.
    Various LLM-training scrapers were absolutely crippling my tiny (~125 weekly unique users) browser game until I put its Wiki behind a login wall. There is no possible way they could see any meaningful return from doing so.
    HankStallone8 days ago
    I get the impression that they're just too lazy or incompetent, or in too big a hurry, to put some sensible logic in their scrapers. Maybe they have an LLM write the scraper and don't bother to ask for anything more than "Make a web scraper that gets all the files it can as fast as it can."
    The last one I blocked was hitting my site 24 times/second, and a lot of them were the same CSS file over and over.
  - protocolture5 days ago
    Being a dick while scraping isnt really the same question as the use of that data.
    Anyway the answer is block em.
- latexr8 days ago
  > One of the many pressing issues is that people believe that ownership of content should be absolute, that hammer makers should be able to dictate what is made with hammers they sell.
  You’re conflating and confusing two different concepts. “Content” is not a tool. Content is like a meal, it’s a finished product meant to be consumed; a tool, like a hammer, is used to create something else, the content which will then be consumed. You’re comparing a JPEG to Photoshop.
  You can remix content, but to do that you use a tool and the result is related but different content.
  > Content belongs to everyone.
  Even if we conceded that point, that still wouldn’t excuse the way in which these companies are going about getting the content, hammering every page of every website with badly-behaved scrapers. They are bringing websites down and increasing costs for their maintainers, meaning other people have limited or no access to it. If “content belongs to everyone”, then they don’t have the right to prevent everyone else from accessing it.
  I agree current copyright law is toxic and harmful to culture and society, but that doesn’t make what these companies are doing acceptable. The way to counter a bad system is not to shit on it from a different angle.
  - malfist7 days ago
    To extended your metaphor, we don't get annoyed an a neighbor knocking on our door to ask us a question, but we absolutely do not want some random stranger that's trying to get rich from knocking on our door as asking questions when they're doing it over and over and over all hours of the day and night.
  - protocolture5 days ago
    >Even if we conceded that point, that still wouldn’t excuse the way in which these companies are going about getting the content
    Unrelated point that I wouldnt even defend. Block em. Its cool with me.
    >You’re conflating and confusing two different concepts. “Content” is not a tool. Content is like a meal, it’s a finished product meant to be consumed; a tool, like a hammer, is used to create something else, the content which will then be consumed. You’re comparing a JPEG to Photoshop.
    Eh I see what you are trying to say but a hammer is also a finished good thats sold as a finished good. I can also modify the hammer if I like. And after modification I can sell the hammer. JPEGs can also be inputs to things like collage.
- blagie8 days ago
  I'd be totally down with "content belongs to everyone."
  The problem is when you steal my content, repackage it, and resell it. At that point, my content doesn't belong to everyone, or even to me, but to you.
  * I'd have no problem with OpenAI, the non-profit developing open source AI models and governance models, scraping everyone's web pages and using it for the public good.
  * I have every problem with OpenAI, the sketchy for-profit, stealing content from my web page so their LLMs can regenerate my content for proprietary products, cutting me out-of-the-loop.
  - protocolture7 days ago
    Yes but I am sure your content that you want to exploit was made in a complete vacuum.
    blagie7 days ago
    "want to exploit"????
- account428 days ago
  So I get to freely copy Windows and Office and use them in products that I sell to others without Microsoft's consent now? Or is this only true when it benefits big corporations?
  - protocolture8 days ago
    Yeah go for it
    I did say this
    >They should be protected from perfect reconstruction and sale
    But I dont even really believe in that much so go nuts.
    jmye8 days ago
    I want to get this straight, given you
    > dont even really believe in that much
    If I write a book, let’s say it’s a really good book, and self-publish it, you’re saying you think it’s totally kosher for Amazon to take that book, make a copy, and then make it a best seller (because they have vastly better marketing and sales tools), while putting their own name in as author?
    That seems, to you, like a totally fine and desirable thing? That literally all content should only ever be monetized by the biggest corporations who can throw their weight around and shut everyone else out?
    Or is this maybe a completely half-baked load of nonsense that sounded better around the metaphorical bong circle?
    Come on, now.
    protocolture5 days ago
    Actually the more common outcome is that some enterprising random makes a compilation of public domain content and markets it for like 25 cents. Competing to make it as available to me as possible.
    Have a look at REH short stories on Google Books. This is super common.
    Do I want someone to do that to your book? To make it as available and as cheap for me to read on the platform of my choice.
    Yes.
    Its just data, and culturally speaking, it already belongs to me. I own your book. People can compete to deliver it to me for the cheapest price. I welcome that.
    I don't begrudge you going on tour, and selling author signed copies for whatever price you want. But likewise don't expect me to support a set a property norms that would deprive me of elements of the culture I live in.
    Come on, now.
    jmye5 days ago
    > that some enterprising random makes a compilation of public domain content
    My hypothetical book is not, at all, public domain. This is always a non-starter.
    > But likewise don't expect me to support a set a property norms that would deprive me of elements of the culture I live in.
    I could simply choose not to publish my book, and carry it around and let people read it in front of me. Apparently this is an insufferable “property norm” as you would be unable to consume my work at all, let alone for free and in the manner of your own choosing. What an absurd thing to believe in.
    Do you similarly think your entitled to sleep on my couch, or eat my dinner, or do you only think you’re entitled to take what you want when it’s words rather than, say, oranges? Or do you just have a weirdly tenuous grasp of what culture is?
    protocolture4 days ago
    >My hypothetical book is not, at all, public domain. This is always a non-starter.
    Right but in your weird strawman argument that assumes big scary amazon can reproduce it for free, it is effectively public domain.
    >I could simply choose not to publish my book, and carry it around and let people read it in front of me. Apparently this is an insufferable “property norm” as you would be unable to consume my work at all, let alone for free and in the manner of your own choosing. What an absurd thing to believe in.
    "I only want to contribute to human society if I can profit by it" as long as you can live knowing you are a sell out, I can live without reading your book, or using it to prop up my table.
    >Do you similarly think your entitled to sleep on my couch, or eat my dinner, or do you only think you’re entitled to take what you want when it’s words rather than, say, oranges? Or do you just have a weirdly tenuous grasp of what culture is?
    "Do you think you are entitled to <Scarce, physical thing> because you believe everyone is entitled to <non scarce, non physical thing intrinsic to human culture, able to be spread around the world to millions of people instantly>"
    No lmao.
    Lerc8 days ago
    Do you think if it were allowable for Amazon to do that, it would actually be profitable for them to do so?
    As soon as any work became popular, anyone could undercut Amazon. If you really think that Amazon is in a position where they can charge significant money for something others can provide for much less, then you are talking about an anticompetitive monopoly.
    If that's the case the problem is not with copyright, it's lack of competition. The situation we have now is just one where copyright means they can't publish just anything, but Amazon can always acquire the rights to something and apply those same resources to make it a best seller. They don't care if the book is great or not. They just want to be able to sell it. Being able to be the only producer of the thing incentives making the thing that they own popular, not the thing that is good. Having the option to pick what succeeds puts them in a dominant negotiating position so they can acquire rights cheaply.
    I guess if that were the case though it would be easy to spot things that were popular when though they seemingly lack merit or any real reason other than a strong marketing department. It would really suck in that world. Not only would there be talented people making good works and earning little money, but most people would not even get to see what they had created. For many creatives, that would be the worst part of it.
    fwip7 days ago
    Yes, Amazon would do that. Why would another person be able to meaningfully "undercut" Amazon here? Amazon would profit from selling e-books even if it's only for 10 cents - and integration with their Kindles and convenience of discovery would make it difficult for anyone to compete meaningfully on price.
    For printed books, economies of scale work in their favor as well - if it costs them $1.20 to manufacture/store/ship a paperback, and me $1.50, how am I supposed to undercut them?
- computerthings8 days ago
  [dead]