125 pointsby antonmks7 hours ago11 comments
  • skilled7 hours ago
    > In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models.

    Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

    • ThrowawayR22 hours ago
      Yes, it's been discussed many times before. All the corporations training LLMs have to have done a legal analysis and concluded that it's defensible. Even one of the white papers commissioned by the FSF ( "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" at https://www.fsf.org/licensing/copilot/copyright-implications... ), concluded that using copyrighted data to train AI was plausibly legally defensible and outlined the potential argument. You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.
      • jkaplowitzan hour ago
        > Even one of the white papers commissioned by the FSF

        Quoting the text which the FSF put at the top of that page:

        "This paper is published as part of our call for community whitepapers on Copilot. The papers contain opinions with which the FSF may or may not agree, and any views expressed by the authors do not necessarily represent the Free Software Foundation. They were selected because we thought they advanced the discussion of important questions, and did so clearly."

        So, they asked the community to share thoughts on this topic, and they're publishing interesting viewpoints that clearly advance the discussion, whether or not they end up agreeing with them. I do acknowledge that they paid $500 for each paper they published, which gives some validity to your use of the verb "commissioned", but that's a separate question from whether the FSF agrees with the conclusions. They certainly didn't choose a specific author or set of authors to write a paper on a specific topic before the paper was written, which a commission usually involves, and even then the commissioning organization doesn't always agree with the paper's conclusion unless the commission isn't considered done until the paper is updated to match the desired conclusion.

        > You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

        This would be consistent with them agreeing with this paper's conclusion, sure. But that's not the only possibility it's consistent with.

        It could alternatively be because they discovered or reasonably should have discovered the copyright infringement less than three years ago, therefore still have time remaining in their statute of limitations, and are taking their time to make sure they file the best possible legal complaint in the most favorable available venue.

        Or it could simply be because they don't think they can afford the legal and PR fight that would likely result.

        • ThrowawayR2an hour ago
          Since I very specifically wrote "commissioned by the FSF" instead of "represents the opinion of the FSF" to avoid misrepresenting the paper, you're arguing against something I have not said.
    • general14653 hours ago
      Did you pirated this movie? No I did not, it is fair use because this movie is nothing more than a statistical correlation to my dopamine production.
      • earthnail3 hours ago
        The movie played on my screen but I may or may not have seen the results of the pixels flashing. As such, we can only state with certainty that the movie triggered the TV's LEDs relative to its statistical light properties.
      • JKCalhoun2 hours ago
        I saw the movie, but I don't remember it now.
      • thaumasiotes2 hours ago
        Note that what copyright law prohibits is the action of producing a copy for someone else, not the action of obtaining a copy for yourself.
      • Ferret74462 hours ago
        Indeed, the "copy" of the movie in your brain is not illegal. It would be rather troublesome and dystopian if it were.
        • visargaan hour ago
          The problem is when you use your "copy" as inspiration and actually create and publish something. It is very hard to be certain you are safe, besides literal expression close paraphrasing is also infringing, using world building elements, or using any original abstraction (AFC test). You can only know after a lawsuit.

          It is impossible to tell how much AI any creator used secretly, so now all works are under suspicion. If copyright maximalists successfully copyright style (vibes), then creativity will be threatened. If they don't succeed, then copyright protection will be meaningless. A catch 22.

        • SoftTalkeran hour ago
          Not yet, anyway.
    • HillRat15 minutes ago
      It's not settled law as it pertains to LLMs, but, yes, creating a "statistical summary" of a book (consider, e.g., a concordance of Joyce's "Ulysses") is generally protected as fair use. However, illegally accessing pirated books to create that concordance is still illegal.
    • NitpickLawyer3 hours ago
      > Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

      It makes some sense, yeah. There's also precedent, in google scanning massive amounts of books, but not reproducing them. Most of our current copyright laws deal with reproductions. That's a no-no. It gets murky on the rest. Nvda's argument here is that they're not reproducing the works, they're not providing the works for other people, they're "scanning the books and computing some statistics over the entire set". Kinda similar to Google. Kinda not.

      I don't see how they get around "procuring them" from 3rd party dubious sources, but oh well. The only certain thing is that our current laws didn't cover this, and probably now it's too late.

      • masfuerte2 hours ago
        Scanning books is literally reproducing them. Copying books from Anna's Archive is also literally reproducing them. The idea that it is only copyright infringement if you engage in further reproduction is just wrong.

        As a consumer you are unlikely to be targeted for such "end-user" infringement, but that doesn't mean it's not infringement.

        • NitpickLawyer29 minutes ago
          https://cases.justia.com/federal/appellate-courts/ca2/13-482...

          This is the conclusion of the saga between the author's guild v. google. It goes through a lot of factors, but in the end the conclusion is this:

          > In sum, we conclude that: (1) Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use. (2) Google’s provision of digitized copies to the libraries that supplied the books, on the understanding that the libraries will use the copies in a manner consistent with the copyright law, also does not constitute infringement. Nor, on this record, is Google a contributory infringer.

        • Ferret74462 hours ago
          Private reproductions are allowed (e.g. backups). Distributing them non-privately is not.
          • masfuertean hour ago
            Backups are permitted (and not for all media) when you legally acquired the source. Scanning a physical book is not a permitted backup, and neither is downloading a book from Anna's archive.
        • amanaplanacanal2 hours ago
          It seems like they pretty much don't care unless you distribute the copy. There is certainly precedent for it, going back to the Betamax case in the 1980s.
      • olejorgenb2 hours ago
        > I don't see how they get around "procuring them" from 3rd party dubious sources

        Yeah, isn't this what Anthropic was found guilty off?

    • threethirtytwo3 hours ago
      It does make sense. It’s controversial. Your memory memorizes things in the same way. So what nvidia does here is no different, the AI doesn’t actually copy any of the books. To call training illegal is similar to calling reading a book and remembering it illegal.

      Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.

      I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.

      What is consistent is that obtaining the books was probably illegal, but say if nvidia bought one kindle copy of each book from Amazon and scraped everything for training then that falls into the grey zone.

      • ckastner3 hours ago
        > To call training illegal is similar to calling reading a book and remembering it illegal.

        Perhaps, but reproducing the book from this memory could very well be illegal.

        And these models are all about production.

        • roblabla3 hours ago
          To be fair, that seems to be where some of the IA lawsuits are going. The argument goes that the models themselves aren't derivative works, but the output they produce can absolutely be - in much the same way that reproducing a book from memory could be copyright violation, trademark infringement, or generally go afoul of the various IP laws.
        • threethirtytwo2 hours ago
          Models don’t reproduce books though. It’s impossible for a model to reproduce something word for word because the model never copied the book.

          Most of the best fit curve runs along a path that doesn’t even touch an actual data point.

          • kalap_ur2 hours ago
            If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws. So model doesnt have to reproduce the entire book, it only required to reproduce one specific sentence (which may be a characteristic sentence to that author or to that book).
            • CamperBob2an hour ago
              If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws.

              Yes, and that's stupid, and will need to be changed.

          • empath752 hours ago
            They do memorize some books. You can test this trivially by asking ChatGPT to produce the first chapter of something in the public domain -- for example a Tale of Two Cities. It may not be word for word exact, but it'll be very close.

            These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:

            https://arxiv.org/abs/2601.02671

            • threethirtytwo2 hours ago
              In that case I would say it is the act of reproducing the books that is illegal. Training the AI on said books is not.

              So the illegality rests at the point of output and not at the point of input.

              I’m just speaking in terms of the technical interpretation of what’s in place. My personal views on what it should be are another topic.

              • ckastner2 hours ago
                > So the illegality rests at the point of output and not at the point of input.

                It's not as simple as that, as this settlement shows [1].

                Also, generating output is what these models are primarily trained for.

                [1]: https://www.bbc.com/news/articles/c5y4jpg922qo

                • threethirtytwo42 minutes ago
                  >Also, generating output is what these models are primarily trained for.

                  Yes but not generating illegal output. These models were trained with intent to generate legal output. The fact that it can generate illegal output is a side effect. That's my point.

                  If you use AI to generate illegal output, that act is illegal. If you use AI to generate legal output that act is not illegal. Thus the point of output is where the legal question lies. From inception up to training there is clear legal precedence for the existence of AI models.

      • lelanthran3 hours ago
        > To call training illegal is similar to calling reading a book and remembering it illegal.

        A type of wishful thinking fallacy.

        In law scale matters. It's legal for you to possess a single joint. It's not legal to possess 400 tons of weed in a warehouse.

        • kalap_ur2 hours ago
          It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.
          • lelanthran17 minutes ago
            > It is not the scale that matters here, in your example, but intent. With 1 joint, you want to smoke yourself. With 400, you very possibly want to sell it to others. Scale in itself doesnt matter, scale matters only as to the extent it changes what your intention may be.

            It sounds then like you're saying that scale does indeed matter in this context, as using every single piece of writing in existence isn't being slurped up purely to learn, it's being slurped up to make a profit.

            Do you think they'd be able to offer a usefull LLM if the model was trained only what what an average person could read in a lifetime?

          • threethirtytwo2 hours ago
            It’s clear nvidia and every single one of these big AI corps do not want their AIs to violate the law. The intent is clear as day here.

            Scale is only used for emergence, openAI found that training transformers on the entire internet would make is more then just a next token predictor and that is the intent everyone is going for when building these things.

        • threethirtytwo2 hours ago
          Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale. The law doesn’t differentiate whether I remember one book or a hundred then there’s no difference for thousands or millions.

          No wishful thinking here.

          • lelanthran21 minutes ago
            > Er no. I’ve read and remember hundreds of books in my life time. It’s not any more illegal based off scale.

            I'm not sure you understood what you said, but superficially it appears that you are agreeing with me?

            Just because it's legal to read 100s of books does not make it legal to slurp up every single piece of produced content ever recorded.

            We're talking man many orders of magnitude in scale there, and you're the one who pointed out that scale :-/

      • kalap_ur2 hours ago
        You can only read the book, if you purchased it. Even if you dont have the intent to reproduce it, you must purchase it. So, I guess NVDA should just purchase all those books, no?
        • ThrowawayR2an hour ago
          Obviously not; one can borrow books from libraries and read them as well.
          • threethirtytwo40 minutes ago
            That's true. But the book itself was legally purchased. So if nvidia went to the library and trained AI by borrowing books, that should be technically legal.
        • threethirtytwo2 hours ago
          Yep, I agree. That’s the part that’s clearly illegal. They should purchase the books, but they didn’t.
          • Nursie2 hours ago
            This is the bit an author friend of mine really hates. They didn’t even buy a copy.

            And now AI has killed his day job writing legal summaries. So they took his words without a license and used them to put him out of a job.

            Really rubs in that “shit on the little guy” vibe.

      • _trampeltieran hour ago
        But to train the models they have to download it first (make a copy)
      • godelskian hour ago
        You need to pay for the books before you memorize them
        • threethirtytwo38 minutes ago
          Partially true. I can pay for a book then lend it out to people for free.

          The government is in full support of this "lending" concept, in fact they have created entire facilities devoted to this very concept of lending out books.

      • Nursie2 hours ago
        But it’s not just about recall and reproduction. If they used Anna’s Archive the books were obtained and copied without a license, before they were fed in as training data.
    • Bombthecat4 hours ago
      Who cares? Only Disney had the money to fight them.

      Everything else will be slurped up for and with AI and be reused.

    • nancyminusone3 hours ago
      When you're responsible for 4% of the global GDP, they let you do it.
      • qingcharlesan hour ago
        They let you just grab any book you want.
    • tobwen6 hours ago
      Books are databases, chars their elements. We have copyright for databases in EU :)
    • RGamma4 hours ago
      The chicken is trying to become the egg.
    • postexitus2 hours ago
      A quite good explanation of what copyright laws cover and should (and should not) cover is here by Cory Doctorow: https://www.theguardian.com/us-news/ng-interactive/2026/jan/...
    • Elfener3 hours ago
      It seems so, stealing copyrighted content is only illegal if you do it to read it or allow others to read it. Stealing it to create slop is legal.

      (The difference, is that the first use allows ordinary poeple to get smarter, while the second use allows rich people to get (seemingly) richer, a much more important thing)

  • poulpy1235 hours ago
    I'm not saying it will change anything but going after Anna's archive while most of the big AI players intensely used it is quite something
    • pjc503 hours ago
      NVIDIA are "legitimate", so anything they do is fine, while AA are "illegitimate", so it's not.
    • gizajoban hour ago
      Library Genesis worked pretty great and unmolested until news came out about Meta using it, at which point a bunch of the main sites disappeared off the net. So not only do these companies take ALL the pirated material, their act of doing so even borks the pirates, ruining the fun of piracy for everyone else.
    • countWSS3 hours ago
      Short-term thinking, they don't care about where the data comes from but how easy is to get it. Its probably decided at project-manager level.
  • antonmks7 hours ago
    NVIDIA executives allegedly authorized the use of millions of pirated books from Anna's Archive to fuel its AI training. In an expanded class-action lawsuit that cites internal NVIDIA documents, several book authors claim that the trillion-dollar company directly reached out to Anna's Archive, seeking high-speed access to the shadow library data.
  • flipped3 hours ago
    Considering AA gave them ~500TB of books, which is astonishing (very expensive to even store for AA), I wonder how much nvidia paid them for it? It has to be atleast close to half a million?
    • qingcharlesan hour ago
      I have a very large collection of magazines. AI companies were offering straight cash and FTP logins for them about a year or so ago. Then when things all blew up they all went quiet.
  • haritha-jan hour ago
    Just to clarify, the most valuable company in the world refuses to pay for digital media?
    • rpdillonan hour ago
      I see this sentiment posted quite a bit, but have the publishers made any products available that would allow AI training on their works for payment? A naive approach would be to go to an online bookstore and pay $15 for every book, but then you have copyrighted content that is encrypted, that it's a violation of the DMCA to decrypt.

      I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.

      I think they're going to the pirate libraries because the product they want doesn't exist.

      • haritha-j33 minutes ago
        Perhaps because authors don't want their content to be used for this purpose? Because Microsoft refuses to give me a copy of the source code to Windows to 'inspire' my vibe-coded OS, Windowpanes 12, of which I will not give microsoft a single cent of revenue, its acceptable for me to pirate it? Someone doesn't want to sell me their work, so I'm justified in stealing it?
    • nexlean hour ago
      they already paid 10x more to their lawyers to ensure that torrenting for LLM training is perfectly legal, why they want to pay more?
    • 1over137an hour ago
      Not spending money (vs spending money) helps make one rich!
    • NekkoDroidan hour ago
      Well... you don't want the good guys (Nvidia) giving money to the bad guys (Anna's Archive) right??? /s
  • utopiah3 hours ago
    People HAVE to somehow notice how hungry for proper data AI companies are when one of the largest companies propping the fastest growing market STILL has to go to such length, getting actual approval for pirated content while they are hardware manufacturer.

    I keep hearing how it's fine because synthetic data will solve it all, how new techniques, feedback etc. Then why do that?

    The promises are not matching the resources available and this makes it blatantly clear.

  • derelictaan hour ago
    I feel like Nvidia's CEO would be the kind to snatch off sugary sachets from his local deli just to save up some more.
  • 1over137an hour ago
    A great retaliation to Trump tariffs would be just cancelling copyright for American works in your country.
  • SanjayMehta4 hours ago
    I'm wondering what Amazon is planning to do with their access to all those Kindle books.
    • quinncoman hour ago
      I was curious:

      • Anna’s Archive: ~61.7 million “books” (plus ~95.7M papers) as of January 2026 https://en.wikipedia.org/wiki/Anna%27s_Archive • Amazon Kindle: “over 6 million titles” as of March 2018 https://en.wikipedia.org/wiki/Anna%27s_Archive

      Hard to compare because AA contains duplicates, and the Kindle number is old, but at a glance it seems AA wins.

    • philipwhiuk3 hours ago
      What do you mean 'planning'. You think they haven't already been sucked up?
      • embedding-shape3 hours ago
        What do you mean 'sucked up'? It's data on their machines already, people willingly give them the data, so Amazon can process and offer it to readers. No sucking needed, just use the data people uploaded to you already.
        • sib3 hours ago
          There's definitely a legal & contractual difference between (1) storing the books on your servers in order to provide them to end users who have purchased licenses to read them and (2) using that same data for training a model that might be used to create books that compete with the originals. I'm pretty sure that's why GP means by "sucking up."

          This is analogous the difference between Gmail using search within your mail content to find messages that you are looking for vs Gmail providing ads inside Gmail based on the content of your email (which they don't do).

          • embedding-shape2 hours ago
            Yeah, I guess the "err" is on my side, I've always took "suck up" as a synonym for scraping, not just "using data for stuff".

            And yeah, you're most likely right about the first, and the contract writers have with Amazon most certainly anticipates this, and includes both uses in their contract. But! Never published on Amazon, so don't know, but I'm guessing they already have the rights for doing so with what people been uploading these last few years.

  • rtbruhan006 hours ago
    It's generous of them to ask for permission.
    • gizajob4 hours ago
      They wanted access to a faster pipe to slurp 500 terabytes, and that access comes at a cost. It wasn’t about permission.

      And yeah they should be sued into the next century for copyright infringement. $4Trillion company illegally downloading the entire corpus of published literature for reuse is clearly infringement, its an absurdity to say that it’s fair use just to look for statistical correlations when training LLMs that will be used to render human authors worthless. One or two books is fair use. Every single book published is not.

      • empath752 hours ago
        Whatever they get sued for would be pocket change.
    • breakingcups4 hours ago
      It wasn't about permission, it was about high-speed access. They needed Anna's Archive to facilitate that for them, scraping was too slow. It's incredible that they were allowed to continue even after Anna's Archive themselves explicitly pointed out that the material was acquired illegally.
      • kristofferR3 hours ago
        That's just normal US modus operandi. The court case against Maduro is allowed to continue even after everyone has acknowledged he was acquired illegally.
    • kristofferR3 hours ago
      It's not permission, it's a service they offer:

      https://annas-archive.li/llm

  • wosined3 hours ago
    Sounds like BS. Why would nvidia need the books. Do they even have a chatbot? I doubt the books help with framegen.
    • johndough24 minutes ago
      From the top of the linked article:

          > NVIDIA is also developing its own models, including NeMo, Retro-48B, InstructRetro, and Megatron. These are trained using their own hardware and with help from large text libraries, much like other tech giants do.
      
      You can download the models here: https://huggingface.co/nvidia
    • utopiah3 hours ago
      The same reason Intel worked on OpenCV : they want to sell more hardware by pushing the state of the art of what software can do on THEIR hardware.

      It's basically just a sales demonstrator, that optionally, if incredibly successful and costly they can still sell as SaaS, if not just offer for free.

      Think of it as a tech ad.

    • voidUpdate3 hours ago
      I cant see the whole relevant section in the article, but there is a screenshot of part of the legal documents that states "In response, NVIDIA sought to develop and demonstrate cutting edge LLMs at its fall 2023 developer day. In seeking to acquire data for what it internally called "NextLargeLLM", "NextLLMLarge" and-" (cuts off here)