A federal judge sides with Anthropic in lawsuit over training AI on books(techcrunch.com)

180 pointsby moose443 days ago16 comments

NobodyNada3 days ago
One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works, but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text. (Alsup compares this to Google Books, which has server-side searchable full-text copies of copyrighted books, but only allows users to access snippets in a non-infringing manner.)
Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?
[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
- riskable3 days ago
  A judge already ruled that models themselves don't constitute copyright infringement in Kadrey v. Meta Platforms, Inc. (https://casetext.com/case/kadrey-v-meta-platforms-inc). The EFF has a good summary about it:
  > the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works.
  See: https://www.eff.org/deeplinks/2025/02/copyright-and-ai-cases...
  - qoez2 days ago
    Time to overfit on some books and publicize them as a libgen mirror.
    londons_explorea day ago
    I think this could lead to interesting results outside the legalities.
    Imagine you're getting it to spit out lord of the rings, but midway through you inject into the output 'Suddenly, the ring split in two. No longer one ring to rule them all, but two!'.
    You then let the model write the rest of the story!
    esperenta day ago
    I'm sure many people have imaged this - supposing that LLMs, while making no great strides towards AGI, consciousness, or any of that, nonetheless keep getting better and better at what they do now. Imagine a decade or two of steady improvements, throw in at least a couple of major breakthroughs. Much longer context by a few orders of magnitude. Much better quality, in terms of tone, consistency, hallucinations.
    Maybe we'll actually be able to say things like: write me a trilogy in the style of Lord of the Rings but with these changes:
    * Make it scifi
    * Add more female characters with greater depth
    * At least five rings
    * Hobbits are the bad guys
    ... Or whatever, specifying a version of the story tailored to your intersts, and that you would get out really high quality results, similar in quality to the source materials.
    Imagine you could do the same with movies, games, music.
    I'm not trying to assign a value judgement here. There's good and bad sides. However, this reality is becoming easier to imagine with each new model released.
    For sure, anyone who is a writer or artist will see this as bad. But perhaps our whole concept of what art is will become more fluid and personalized.
    falcor84a day ago
    I love that, and only wanted to nitpick that not only are there "at least five rings" in the story, but a full score (20): 9 for men, 7 for dwarves, 3 for elves and the one ring.
    esperenta day ago
    Damn you're right, it's been too many years since I read it. I was following on from the previous comment talking about splitting the ring in two.
- comex3 days ago
  Yes and no.
  In this case, the plaintiffs alleged that Anthropic's LLMs had memorized the works so completely that "if each completed LLM had been asked to recite works it had trained upon, it could have done so", "almost verbatim". The judge assumed for the sake of argument that the allegation was true, and ruled that the conduct was fair use anyway due to the existence of an effective filter. Therefore there was no need to determine whether the allegation was actually true.
  So - yes, in the sense that the ruling suggests that distributing an open-weight LLM that memorized copyrighted works to that extent would not be fair use.
  But no, in the sense that it's not clear whether any LLMs, especially open-weight LLMs, actually memorize book-length works to that extent. Even the recent study about Llama memorizing a Harry Potter book [1] only said that Llama could reproduce 50-token snippets a decent percentage of the time when given the preceding 50 tokens. That's different from actually being able to recite any substantial portion of the book. If you asked Llama for that, the output would quickly diverge from the original text, and it likely wouldn't be able to get back on track without being re-prompted from the ground truth as the study did.
  On the other hand, in the case where the New York Times is suing OpenAI, the NYT has alleged that ChatGPT was able to recite extensive portions of NYT articles verbatim. If true, this might be more dangerous, since news articles are not as long as books but they're equally eligible for copyright protection. So we'll see how that shakes out.
  Also note:
  - Nothing in the opinion sets formal precedent because it's a district court. But the opinion might still influence later judges.
  - See also riskable's sibling comment for another case where a judge addressed the issue more head-on (but wasn't facing the same kind of detailed allegations, I don't think; haven't checked).
  [1] https://arxiv.org/abs/2412.06370
  - ethbr13 days ago
    Wouldn't a model that can recite training data verbatim be larger than necessary? Exact text isn't coming from nowhere, no matter how efficiently the bits are encoded, and the same effectiveness should be achievable by compressing those portions of the model.
    zeven72 days ago
    Maybe we are all just LLMs. If the books were written by a language producing algorithm in a human mind, maybe there’s not as much raw data there as it seems, and the total information can in fact be stored in a surprisingly small set of weights.
    ethbr12 days ago
    I imagine it's not inconceivable that at very high dimensions and with the right architectures stochastic compression can be unexpectedly efficient. It would be strange if the end result of AI research is realizing we're solving a compression problem (and that our brains do too).
- ticulatedspline3 days ago
  Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.
  Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.
  - bonoboTP3 days ago
    This technology is a really bad way of storing, reproducing and transmitting the books themselves. It's probabilistic and lossy. It may be possible to reproduce some paragraphs, but no reasonable person would expect to read The Da Vinci Code by prompting the LLM. Surely the marketed use cases and the observed real use by users has to make it clear that the intended and vastly overwhelming use of an LLM is transformative, "digestive" synthesis of many sources to construct a merged, abstracted, generalized system that can function in novel uses, answering never before seen prompts in a useful manner, overwhelmingly without reproducing existing written works. It surely matters what the purpose of the thing is both in intention and observed practice. It's not a viable competing alternative to reading the actual book.
    spit2wind3 days ago
    Not The DaVinci Code, but I recently tried reading "OCaml Programming: Correct + Efficient + Beautiful" through Gemini. The book is open, so I rightly assumed it was "in there". I read by saying "Give me the first paragraph of Chapter 6" and then something like "Next 3 paragraphs". If I had a question, I was able to ask it and get some more info and have something like a dialog.
    As far as I could tell, the book didn't match what's posted online today. The text was somewhat consistent on a topic, yet poorly written and made references to sections that I don't think existed. No amount of prompting could locate them. I'm not convinced the material presented to me was actually the book, although it seemed consistent with the topic of the chapter.
    I tried to ascertain when the book had been scraped, yet couldn't find a match in Archive.org or in the book's git repo.
    Eventually I gave up and just continued reading the PDF.
    munificent3 days ago
    The number of people who buy Cliffs Notes versions of books to pass examinations where they claim to have read the actual book suggests you are way overestimating how "reasonable" many people are.
    bonoboTP3 days ago
    Cliff Notes are fair use. Would you argue otherwise? Wikipedia also has plot summaries without infringement.
    munificent3 days ago
    In your parent comment, you argued what people would do in practice. Now you have shifted to talking about what is legal or not to do.
    I'm not a legal scholar, so I'm not qualified or interested in arguing about whether Cliff Notes is fair use. But I do care about how people behave, and I'm pretty sure that Cliff Notes and LLMs lead to fewer books being purchased, which makes it harder for writers to do what they do.
    In the case of Cliff Notes, it probably matters less because because the authors of 19th century books in your English 101 class are long dead and buried. But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.
    bonoboTP3 days ago
    It surely matters whether people actually use the thing for copyright violations or not. Summaries are not even copyright violations so that's irrelevant. Long verbatim copies would be, but one would have to demonstrate that this use case is significant, convenient enough to provide a viable alternative to otherwise obtaining the particular text chunk etc.
    ----
    > But for authors of newer technical material, yes, I think LLMs will make it harder for those people to be able to afford to spend the time thinking, writing, and sharing their expertise.
    Alright, you're now arguing for some new regulations though, since this is not a matter for copyright.
    In that context, I observe that many academics already put their technical books online for free. Machine learning, computer vision, robotics etc. I doubt it's a hugely lucrative thing in the first place.
    munificent2 days ago
    > Alright, you're now arguing for some new regulations though
    No, I'm not. I'm not talking about law at all. You talked about what reasonable people do and I'm also talking about what people do.
    > I observe that many academics already put their technical books online for free.
    As do I, which is why the LLMs are trained on it and are able to so effectively regurgitate it.
    > I doubt it's a hugely lucrative thing in the first place.
    This is true in many cases, but you might be surprised.
  - lcnPylGDnU4H9OF3 days ago
    > broadly capable open models are on track for annihilation
    I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).
    I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.
    ijk3 days ago
    In theory, couldn't you distill a non-infringing model from an infringing one? Just prompt it for continuations and give it a whack every time the output matches something in your dataset of copyrighted works.
    You'd need the copyrighted works to compare to, of course, though if you have the permissible training data (as Anthropic apparently does) it should be doable.
  - dragonwriter3 days ago
    > a model file that contains enough of the source material to be considered infringing
    The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.
    fallingknife3 days ago
    But there is the issue of whether there are damages. If my LLM can reproduce 10 random paragraphs of a Harry Potter book, it's obvious that nobody would have otherwise purchased the book if they couldn't read those 10 paragraphs. So there will not be any damages to the publisher and the lawsuit will be tossed. There is a threshold of how much of it needs to be reproduced, and how closely, but it's a subjective standard and not some hard line like if it's > 50%.
    dragonwriter3 days ago
    > But there is the issue of whether there are damages.
    Not if there isn't infringement. Infringement is a question that precedes damages, since "damages" are only those harms that are attributable to the infringement. And infringement is an act, not an object.
    If training a general use LLM on books isn't infringement (as this decision holds), then there by definition cannot be damages stemming from it; the amount of the source material that the model file "contains" doesn't matter.
    It might matter to whether it is possible for a third party to easily use the model for something that would be infringement on the part of the third party, but that would become a problem for people who use it for infringement, not the model creator, and not for people who simply possess a copy of the model. The model isn't "an infringing object".
  - thaumasiotes3 days ago
    > even without using the LLM, assume you can extract the contents directly out of the weights
    This is still a weird language shift that actively promotes misunderstandings.
    The weights are the LLM. When you say "model", that means the weights.
  - vinni23 days ago
    > extract the contents directly out of the weights
    If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.
  - CamperBob23 days ago
    Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.
    This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.
    The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.
    throwaway562if13 days ago
    For that matter, how dare the government fine me for dumping waste in the river, and stop me from employing minors? Don't they know it will ruin the economy?
    CamperBob23 days ago
    Copyright is something we invented from thin air, and relatively recently at that. Meanwhile, refraining from fouling their own nests is something that most animals have accomplished instinctually for millions of years.
    So, not really comparable.
    psunavy033 days ago
    [flagged]
    CamperBob23 days ago
    We spun "intellectual property" law from whole cloth. We'll need to reweave it now. Deal with it.
    ethbr13 days ago
    Rewriting does not mean destroying, as the cannibalization of news reporting by social media should have taught us.
    It's entirely possible for something to be suboptimal in the specific (I would like this thing for free), but optimal on the whole (society benefits from this thing not being free).
    CamperBob22 days ago
    The societal benefits we've enjoyed from copyright law have been substantial, but the upside is completely maxed out at this point. The tail has been wagging the dog since the MPAA and RIAA grew into de-facto government agencies.
    The potential societal benefits to AI are unbounded, but only if it's allowed to develop without restrictions that artificially favor legacy interests.
    Any decision or legislation that says that training is not fair use -- and yes, that includes gaining access to the content in the first place by any means necessary -- will have net-negative effects on the society that enforces it.
    ethbr12 days ago
    > The potential societal benefits to AI are unbounded, but only if it's allowed to develop without restrictions that artificially favor legacy interests.
    That's a very strong claim based on currently limited evidence.
    It's in no way clear that AI has an infinite ability to scale capability, nor that that can only be done by completely ignoring compensating those who provide training data.
    OpenAI and Anthropic would love that to be true... but the facts don't support it.
    psunavy032 days ago
    The right to own property and the fruits of one's labor is a fundamental natural right, not something we "spun from whole cloth."
    Deal with it.
  - 3 days ago
    undefined
- PeterStuer2 days ago
  No. You are free to memorize any copyrighted work. You are just not free to distribute it.
  The model itself does not constitute a copy. Its intention is clearly not to reproduce verbatim texts. There would be far cheaper and infinitly more accurate ways to do that if that was the goal.
  Appart from the legalities, it would be horrifying if copyright reached into the AI realm to completely styfle progress for, lets be honest, mainly the profits of a few major IP corporations.
  I do however understand some creatives are worried about revenue, just like the rest of us. But just like the rest of us, they to live in a world that can only exist because 99.99% of what it took to build that world was automated or tool enhanced, impacting someone's previous employment or business.
  We are in a world of unprecedented change, only to be immediatly supassed by the next day's rate of change. This both scares and fascinates me.
  But that change and its benefits being held only in the bowels of corporate/government symbiotic entities would scare me a hell of a lott more. Open Source/weights is the only way to have a small chance to keep this at bay.
- dragonwritera day ago
  > One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works,
  No, it doesn't. The order assumes that because it is an order on summary judgement, and the legal standard for such an order is that it must assume the least favorable position for the party for whom summart judgement is granted on every material contested issue of fact. Since it is a ruling for the defendant (Anthropic), it must be what the judge finds law demands when assuming all contested issues of fact are resolved in favor of the claims of the plaintiffs (the authors).
  > but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text.
  No, it doesn't do that, either. It simply notes for clarity that the plaintiffs do not allege that that an infringement is created by the outputs for the reason you describe; the ruling does not in any way suggest that has any bearing on its findings as regards whether training the model infringes, it simply points out that that separate potential source of infringement is not at issue.
  > Does this imply that distributing open-weights models such as Llama is copyright infringemen
  No, it does not. At most, it implies, given the reason that rhe plaintiffs have not done so in this case, that the same plaintiffs might have alleged (without commenting at all as to whether they would prevail) that providing a hosted online service without filtering would constitute contributory infringement if that was what Anthropic did (which it isn’t) and if there was actual infringement committed by the users of the service.
- protocolturea day ago
  I am yet to have anyone explain to my why LLM memorisation is worse than Google images or a similar service caching thumbnails for faster image searches. Or caching blurbs of news stories for faster reproduction at search time.
- martin-t3 days ago
  Copyright was codified in an age where plagiarism was time consuming. Even replacing words with synonyms on a mass scale was technically infeasible.
  The goal of copyright is to make sure people can get fair compensation for the amount of work they put in. LLMs automate plagiarism on a previously unfathomable scale.
  If humans spend a trillion hours writing books, articles, blog posts and code, then somebody (a small group of people) comes and spends a million hours building a machine that ingests all the previous work and produces output based on it, who should get the reward for the work put in?
  The original authors together spent a million times more effort (normalized for skill) and should therefore should get a million times bigger reward than those who build the machine.
  In other words, if the small group sells access to the product of the combined effort, they only deserve a millionth of the income.
  ---
  If "AI" is as transformative as they claim, they will have no trouble making so much money they they can fairly compensate the original authors while still earning a decent profit. But if it's not, then it's just an overpriced plagiarism automator and their reluctance to acknowledge they are making money on top of everyone else's work is indicative.
  - bonoboTP3 days ago
    > get fair compensation for the amount of work
    This is a bit distorted. This is a better summary: The primary purpose of copyright is to induce and reward authors to create new works and to make those works available to the public to enjoy.
    The ultimate purpose is to foster the creation of new works that the public can read and written culture can thrive. The means to achieve this is by ensuring that the authors of said works can get financial incentives for writing.
    The two are not in opposition but it's good to be clear about it. The main beneficiary is intended to be the public, not the writers' guild.
    Therefore when some new factor enters the picture such as LLMs, we have to step back and see how the intent to benefit the reading public can be pursued in the new situation. It certainly has to take into account who and how will produce new written works, but it is not the main target, but can be an instrumental subgoal.
    martin-t3 days ago
    As you point out, people make rules ("laws") which benefit them. I care about fairness and justice though, even if I am a minority.
    Fundamentally, fair compensation is based on the amount of work put in (obviously taking skill/competence into account but the differences between people in most disciplines probably don't span a single order of magnitude, let alone several).
    The ultimate goal should be to prevent people who don't produce value from taking advantage of those who do. And among those who do, that they get compensated according to the amount of work and skill they put in.
    Imagine you spend a year building a house. I have a machine that can take your house and materialize a copy anywhere on earth for free. I charge people (something between 0 and the cost of building your house the normal way) to make them a copy of your house. I can make orders of magnitude more money this way than you. Are you happy about this situation? Does it make a difference how much i charge them?
    What if my machine only works if I scan every house on the planet? What if I literally take pictures of it from all sides, then wait for your to not be home and xray it to see what it looks like inside?
    You might say that you don't care because now you can also afford many more houses. But it does not make you richer. In fact, it makes you poorer.
    Money is not a store of value. If everyone has more money but most people only have 2x more and a small group has a 1000x more, then the relative bargaining power changed so the small group is better off and the large group is worse off. This is what undetectable cheap mass plagiarism leads to for all intellectual work.
    ---
    I wrote a lot of open source code, some of it under permissive licenses, some GPL, some AGPL. The conditions of those licenses are that you credit me. Some of them also require that if you build on top of my work, you release your work with the same licence.
    LLMs launder my code to make profit off of it without giving me anything (while other people make profit, thus making me poorer) and without crediting me.
    LLMs also take away the rights of the users of my code - (A)GPL forced anyone who builds on top of my work to release the code when asked, with LLM-laundered code, this right no longer seems to exist because who do you even ask?
    bonoboTP3 days ago
    I understand your sense of justice in cheering on David against Goliath. But the equation is not so clear. The common person is sometimes on this side, sometimes on that side. Copyright can also be weaponized by megacorps against normal people (copying Disney movie DVDs) and LLMs can also be in the hands of the decentralized public (llama ecosystem).
    The house thing is a bit offtopic because to be considered for copyright, only its artistic, architectural expression matters. If you want to protect the ingenuity in the technical ways of how it's constructed, that's a patent law thing. It also muddies the water by bringing in aspects of the privacy of one's home by making us imagine paparazzi style photoshoots and sneaky X rays.
    The thing is, houses can't be copied like bits and bytes. I would copy a car if I could. If you could copy a loaf of bread for free, it would be a moral imperative to do so, whatever the baker might think about it.
    > fair compensation is based on the amount of work put in
    This is the labor theory of value, but it has many known problems. For example that the amount of work put in can be disconnected from the amount of value it provides to someone. Pricing via supply/demand market forces have produced much better outcomes across the globe than any other type of allocation. Of course moderated by taxes and so on.
    But overall the question is whether LLMs create value for the public. Does it foster prosperity of society? If yes, laws should be such that LLMs can digest more books rather than less. If LLMs are good, they should not be restricted to be trained on copyright-expired writings.
    ethbr13 days ago
    The "fairness" argument is weaker than the "sustainable creation" one.
    If LLMs could create quality literature, or social media create in-depth reporting, then I'd have no problem with the tide of technological progress flowing.
    Unfortunately, recent history has shown that it's trivial for the market to cannibalize the financial model of creators without replacing it.
    And as a result, society gets {no more that thing} + {watered down, shitty version}.
    Which isn't great.
    So I'd love to hear an argument from the 'fuck copyright, let's go AI' crowd (not the position you seem to be espousing) on what year +10 of rampant AI ingestion of copyrighted works looks like...
    bonoboTP3 days ago
    I guess the optimistic take would be that we will get novel, insightful synthesis of disparate fields of knowledge that no human so far was ever able to hold in their mind to contemplate their interrelations. And this will elevate the human spirit etc. The equivalent to the take that the Internet will bring peoples together and foster better understanding and love between people who so far were not in dialogue and this will bring peace and an understanding or how similar we all are etc etc. Not exactly how it played out in the end though. Or how social media and web 2.0 will bring enhanced democracy and transparency and clarity and that the common person will have a voice and so on.
    So I'm not exactly naive, but we should then discuss this instead of the red herring of copyright.
    ethbr12 days ago
    I suppose another strongman would be that LLMs substantially decrease the cost of human creation (i.e. the HITL assistant use case) while producing an output of equivalent quality.
    As a result of this, everything gets cheaper and more plentiful.
    The counterargument I'd make to that would be the requirement that the human have creative skills, which might atrophy in the absence of business models supporting a career creating.
    bonoboTP2 days ago
    Generally, having cheap mass produced things can be great compared to only expensive artisanal stuff that only the rich can afford. Think about furniture, clothes etc. or all the other stuff you have in the house, compared to 100-150 years ago. Today we can buy pretty good mass produced furniture for example. A few generations ago people either did it themselves in a wonky way or paid a lot of money for a hand made carpentry option. Just like with LLMs. LLMs probably do a better job in general writing than a random person off the street. But it's not as good as the top performers. But it's much cheaper. It's a tradeoff.
    ethbr12 days ago
    The difficulty is the biggest gains there are for singular goods which can't be copied at low cost.
    Exquisitely designed piece of furniture = expensive copy
    Well-written book = cheap copy, post-printing press
    So we're not necessarily going to get "more access to better" (because we already had that), but just "cheaper".
    Whether that hollows out entire markets or only cannibalizes the bottom of the market (low quality/cheap) remains to be seen.
    I wouldn't want to be writing pulp/romance novels these days...
    jay_kyburz3 days ago
    >Fundamentally, fair compensation is based on the amount of work put in.
    I think there is a problem with your initial position. Nobody is entitled to compensation for simply working on something. You have to work on things that people need or want. There is no such thing "fair compensation".
    It is "unfair" to take the work of somebody else and sell it as your own. (I don't think the LLMs are doing this.)
    martin-t2 days ago
    Yes, I meant when working on the same thing (which has a specific value as a whole).
    If the LLM and its output are based on 10^12 hours of work, out of which 10^6 is working on the code of the LLM itself and 10^12-10^6 (so roughly still 10^12) is working on the training data, does it make sense for only those working on the 10^6 to be compensated for the work?
    amenhotep3 days ago
    The "you wouldn't download a car" argument made with a straight face. Remarkable.
  - msabalau3 days ago
    Copyright's goal, at least under Constitution under which this court is ruling is to "promote the progress of science and the useful arts" not to ensure that authors get paid for anything that strikes their whim.
    LLMs are models of languages, which are models of reality. If anyone deserves compensation, it's humanity as a whole, for example by nationalizing, or whatever the global equivalent is, LLMs.
    Approximately none of the value of LLMs, for any user, is in recreating the text written by an author. Authors have only ever been entitled to (limited) ownership their expression, copyright has never given them ownership of facts.
- clvx3 days ago
  Wouldn’t the issue be executing the models to third parties without filters? No idea if this is right but the same it would apply to Anthropic that they couldn’t run the model without the filter system having a chicken an egg problem. Can’t develop the filter without looking into the model.
  - dr-detroit3 days ago
    [dead]
- deadbabe3 days ago
  You can use the copyrighted text for personal purposes.
  - layer83 days ago
    But you can’t distribute it, which in the scenario mentioned in the parent’s final paragraph arguably happens.
    AnthonyMouse3 days ago
    You can't distribute the copyrighted works, but that isn't inherently the same thing as the model.
    It's sort of like distributing a compendium of book reviews. Many of the reviews have quotes from the book. If there are thousands of reviews, you could potentially reconstruct the whole book, but that's not the point of the thing and so it makes sense for the infringing thing to be "using it to reconstruct the whole book" rather than "distributing the compendium".
    And then Anthropic fended off the argument that their service was intended for doing the former because they were explicitly taking measures to prevent that.
    layer83 days ago
    The premise was that the model is able to reproduce the memorized text, and that what saved Anthropic was them having server-side filtering to avoid reproducing that text. So the presumption is that without those filters, the model would be able to reproduce text substantial enough to constitute a copyright violation (otherwise they wouldn’t need the filter argument). Distributing a “machine” producing such output would constitute copyright infringement.
    Maybe this is a misrepresentation of the actual Anthropic case, I have no idea, but it’s the scenario I was addressing.
    AnthonyMouse2 days ago
    > Distributing a “machine” producing such output would constitute copyright infringement.
    This is the thing you haven't established.
    Any ordinary general purpose computer is a "machine" that can produce copyrighted text, if you tell it to. But isn't it pretty important whether you actually do that with it or not, since it's a general purpose tool that can also do a large variety of other things?
  - dragonwriter3 days ago
    You can also, in the US, use it for any purposes which fall within the domain of "fair use", which while now also incorporated in the copyright statute, was first identified as an application of the first amendment and, as such, a constitutional limit on what Congress even had the power to prohibit with copyright law (the odd parameters of the statutory exception are largely because it attempted to codify the existing Constitutional case law.)
    Purposes which are fair use are very often not at all personal.
    (Also, "personal use" that involves copying, creating a derivative work, or using any of the other exclusive rights of a copyright holder without a license or falling into either fair use or another explicit copyright exception are not, generally, allowed, they are just hard to detect and unlikely to be worth the copyright holder's time to litigate even if they somehow were detected.)
  - AtlasBarfed3 days ago
    Hey can I have a fake llm "trained" on a set of copyrighted works to ask what those works are?
    So it totally isn't a warez streaming media server but AI?
    I'm guessing since my net worth isn't a billion plus, the answer is no
    AnthonyMouse3 days ago
    People have been coming up with convoluted piracy loopholes since the invention of copyright.
    If you xor some data with random numbers, both the result and the random numbers are indistinguishably random and there is no way to tell which one came out of a random number generator and which one is "derived" from a copyrighted work. But if you xor them together again the copyrighted work comes out. So if you have Alice distribute one of the random looking things and Bob distribute the other one and then Carol downloads them both and reconstructs the copyrighted work, have you created a scheme to copy whatever you want with no infringement occurring?
    Of course not, at least Carol is reproducing an infringing work, and then there are going to be claims of contributory infringement etc. for the others if the scheme has no other purpose than to do this.
    Meanwhile this problem is also boring because preventing anyone from being the source of infringing works isn't a thing anybody has been able to do since at least as long as the internet has allowed anyone to set up a server in another jurisdiction.
3PS3 days ago
Broadly summarizing.
This is OK and fair use: Training LLMs on copyrighted work, since it's transformative.
This is not OK and not fair use: pirating data, or creating a big repository of pirated data that isn't necessarily for AI training.
Overall seems like a pretty reasonable ruling?
- derbOac3 days ago
  But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine. I guess I fail to see how it's any different from me using it in some other way? If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.
  I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission." Maybe if they suddenly loosened copyright enforcement for everyone I might feel differently.
  "Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror." (An admittedly hyperbolic comparison, but similar idea.)
  - rcxdude3 days ago
    >If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.
    I think that's the conclusion of the judge. If Anthropic were to buy the books and train on them, without extra permission from the authors, it would be fair use, much like if you were to be inspired by it (though in that case, it may not even count as a derivative work at all, if the relationship is sufficiently loose). But that doesn't mean they are free to pirate it either, so they are likely to be liable for that (exactly how that interpretation works with copyright law I'm not entirely sure: I know in some places that downloading stuff is less of a problem than distributing it to others because the latter is the main thing that copyright is concerned with. And AFAIK most companies doing large model training are maintaining that fair use also extends to them gathering the data in the first place).
    (Fair use isn't just for discussion. It covers a broad range of potential use cases, and they're not enumerated precisely in copyright law AFAIK, there's a complicated range of case law that forms the guidelines for it)
    tsumnia3 days ago
    I think the issue is that its actually quite difficult to "unlearn" something once you've seen it. I'm speaking more from human-learning rather than AI-learning, but since AI is inspired by our view on nature, it will have similar qualities. If I see something that inspires, regardless of if I paid for that, I may not even know what specifically inspired me. If I sit on a park bench and an idea comes to me, it could come from a number of things - the bench, park, weather, what movie I watched last night, stuff on the wall of a restaurant while I was eating there, etc.
    While humans don't have encyclopedic memories, our brain connects a few dots to make a thought. If I say "Luke, I am your father", it doesn't matter that isn't even the line is wrong, anyone that's seen Star Wars knows what I'm quoting. I may not be profiting from using that line, but that doesn't stop Star Wars from inspiring other elements of my life.
    I do agree that copyright law is complicated and AI is going to create even more complexity as we navigate this growth. I don't have a solution on that front, just a recognition that AI is doing what humans do, only more precisely.
    altruios3 days ago
    which AFAIN IANAL, copyright and exhaustive rights are completely different. Under copyright, once a book is purchased: that's it. Reselling the same, or transformed (re: highlighted) worked 'used' is 100% legal, as is consuming it at your discretion (in your mind {a billion times}, a fire, or (yes even) what amounts to a fancy calculator).
    (that's all to say copyright is dated and needs an overhaul)
    But that's taking a viewpoint of 'training a personal AI in your home', which isn't something that actually happens... The issue has never been the training data itself. Training an AI and 'looking at data and optimizing a (human understanding/AI understanding) function over it' are categorically the same, even if mechanically/biologically they are very different.
  - dragonwriter3 days ago
    > I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission."
    That's not what the ruling says.
    It says that training a generative AI system not designed primarily as a direct replacement for a work on one or more works is fair use, and that print-to-digital destructive scanning for storage and searchability is fair use.
    These are both independent of whether one person or a giant company or something in between is doing it, and independent of the number of works involved (there's maybe a weak practical relationship to the number of works involved, since a gen AI tool that is trained on exactly one work is probably somewhat less likely to have a real use beyond a replacement for that work.)
  - fallingknife3 days ago
    But if you did pirate the book, and let's say it cost $50, and then you used it to write a play based on that book and made $1 million selling that, only the $50 loss to the publisher would be relevant to the lawsuit. The fact that you wrote a non-infringing play based on it and made $1 million would be irrelevant to the case. The publisher would have no claim to it.
  - comex3 days ago
    The judge actually agreed with your first paragraph:
    > This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.
    (But the judge continued that "this order need not decide this case on that rule": instead he made a more targeted ruling that Anthropic's specific conduct with respect to pirated copies wasn't fair use.)
  - tantalor3 days ago
    The analogy to training is not writing a play based on the work. It's more like reading (experiencing) the work and forming memories in your brain, which you can access later.
    I'm allowed to hear a copyrighted tune, and even whistle it later for my own enjoyment, but I can't perform it for others without license.
    AlienRobot3 days ago
    This is nonsense, in my opinion. You aren't "hearing" anything. You are literally creating a work, in this case, the model, derived from another work.
    People need to stop anthropomorphizing neural networks. It's a software and a software is a tool and a tool is used by a human.
    adinisom3 days ago
    Humans are also created/derived from other works, trained, and used as a tool by humans.
    It's interesting how polarizing the comparison of human and machine learning can be.
    tantalor3 days ago
    It is easy to dismiss, but the burden of proof would be on the plaintiff to prove that training a model is substantially different than the human mind. Good luck with that.
    AlienRobot2 days ago
    That makes no sense as a default assumption. It's like saying FSD is like a human driver. If it's a person, why doesn't it represent itself in court? What wages is it being paid? What are the labor rights of AI? How is it that the AI is only human-like when it's legally convenient?
    What makes far more sense is saying that someone, a human being, took copyrighted data and fed it into a program that produces variations of the data it was fed. This is no different from a photoshop filter, and nobody would ever need to argue in court that a photoshop filter is not a human being.
  - protocolturea day ago
    If I buy a book, and use it to prop up the table on which I build a door, I dont owe the author any additional money over what I paid for it.
    If I buy a book, and as long as the product the book teaches me to build isnt a competing book, the original author should have no avenue for complaint.
    People are really getting hung up on the computer reading the data and computing other data with it. It shouldnt even need to get to fair use. Its so obviously none of the authors business well before fair use.
  - klabb33 days ago
    > But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine.
    Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work. The entire thing is based on economic pragmatism, because just copying does obviously not deprive the creator of the work itself, so the only justification in the first place is to protect those who seek to sell immaterial goods, by allowing them to decide how it can be used.
    Coming to the conclusion that you can ”fair use” yourself out of paying for the most critical part of your supply makes me upset for the victims of the biggest heist of the century. But in the long term it can have devastating chilling effects, where information silos will become the norm, and various forms of DRM will be even more draconian.
    Plus, fair use bypasses any licensing, no? Meaning even if today you clearly specify in the license that your work cannot be used in training commercial AI, it isn’t legally enforceable?
    growse3 days ago
    > Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work.
    This makes no sense. If I buy and read a book on software engineering, and then use that knowledge to start a career, do I owe the author a percentage of my lifetime earnings?
    Of course not. And yet I've made money with the help of someone else's intellectual work.
    Copyright is actually pretty narrowly defined for _very good reason_.
    klabb32 days ago
    > If I buy and read a book on software engineering
    You're comparing that you as an individual purchase one copy of a book to a multi-billion dollar company systematically ingesting them for profit without any compensation, let alone proportional?
    > do I owe the author a percentage of my lifetime earnings?
    No, but you are a human being. You have a completely different set of rights from a corporation, or a machine. For very good reason.
    growse2 days ago
    Does copyright law apply differently to humans Vs organisations?
    > without any compensation,
    Didn't Anthropic buy the books?
    lurkshark3 days ago
    If you pirate a book on software engineering and then use that knowledge to start a career, do you owe the author the royalties they would be paid had you bought the book?
    If the career you start isn't software engineering directly but instead re-teaching the information you learned from that book to millions of paying students, is the regular royalty payment for the book still fair?
- ticulatedspline3 days ago
  Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"
  Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.
  In this case if you hired a bunch of artists/writers that somehow had never seen a Disney movie and to train them to make crappy Disney clones you made them watch all the movies it certainly would be legal to do so but only if they had legit copies in the training room. Pirating the movies would be illegal.
  Though the downside is it does create a training moat. If you want to create the super-brain AI that's conversant on the corpus of copyrighted human literature you're going to need a training library worth millions
  - martin-t3 days ago
    > Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.
    Human time is inherently valuable, computer time is not.
    The issue with LLMs is that they allow doing things at a massive scale which would previously be prohibitively time consuming. (You could argue but them how much electricity is worth one human life?)
    If I "write" a book by taking another and replacing every word with a synonym, that's obviously plagiarism and obviously copyright infringement. How about also changing the word order? How about rewording individual paragraphs while keeping the general structure? It's all still derivative work but as you make it less detectable, the time and effort required is growing to become uneconomical. An LLM can do it cheaply. It can mix and match parts of many works but it's all still a derivative of those works combined. After all, if it wasn't, it would produce equally good output with a tiny fraction of the training data.
    The outcome is that a small group of people (those making LLMs and selling access to their output) get to make huge amounts of money off of the work of a group that is several orders of magnitude larger (essentially everyone who has written something on the internet) without compensating the larger group.
    That is fundamentally exploitative, whether the current laws accounted for that situation or not.
  - johnnyanmac3 days ago
    That's a part of the issue. I'm not sure if this has happened in visual arts, but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.
    I see elements of that here. Buying copyrighted works not to be exposed and be inspired, nor to utilize the aithor's talents, but to fuel a commercialization of sound-a-likes.
    lesuorac3 days ago
    > You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet"
    Keep in mind, the Authors in the lawsuit are not claiming the _output_ is copyright infringement so Alsup isn't deciding that.
    Dracophoenix3 days ago
    > but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.
    You're referencing Midler v Ford Motor Co in the 9th circuit. This case largely applies to California, not the whole nation. Even then, it would take one Supreme Court case to overturn it.
  - tgv3 days ago
    > Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"
    How many copies? They're not serving a single client.
    Libraries need to have multiple e-book licenses, after all.
    ticulatedspline3 days ago
    In the human training case probably a Store DVD would still run afoul of that licensing issue. That's a broader topic of audience and I didn't want to muddy the analogy with that detail.
    It changes the definition of what a "legal copy" is but the general idea that the copy must be legal still stands.
    tgv3 days ago
    Fair enough.
  - alganet3 days ago
    What you are describing happened and they got sued:
    https://en.wikipedia.org/wiki/Mickey_Mouse#Walt_Disney_Produ...
    I'm on the Air Pirates side for the case linked, by the way.
    However, AI is not a parody. It's not adding to the cultural expression like a parody would.
    Let's forget all the law stuff and these silly hypotheticals. Let's think of humanity instead:
    Is AI contributing to education and/or culture _right now_, or is it trying to make money? I think they're trying to make money.
    fallingknife3 days ago
    > It's not adding to the cultural expression like a parody would.
    Says who?
    > Is AI contributing to education and/or culture _right now_, or is it trying to make money?
    How on earth are those things mutually exclusive? Also, whether or not it's being used to make money is completely irrelevant to whether or not it is copyright infringement.
    alganet3 days ago
    > Says who?
    Artists.
    https://en.wikipedia.org/wiki/SAG-AFTRA
    > How on earth are those things mutually exclusive?
    Put those on a spectrum and rethink what I said.
    > completely irrelevant to whether or not it is copyright infringement
    _Again_, leave aside law minutiae and hypotheticals.
    shagie2 days ago
    > > Says who?
    > Artists.
    > https://en.wikipedia.org/wiki/SAG-AFTRA
    Do you have a link that has their stance on how AI is harming culture? The best I could find is https://www.sagaftra.org/contracts-industry-resources/member...
    I can't find anything in there or its linked articles about culture. I do find quite a bit about synthetic performers and digital replicas and making sure that people who do voice acting don't have their performance used to generate material that is done at a discounted rate and doesn't reimburse the performer.
    https://www.sagaftra.org/ongoing-fight-ai-protections-makes-...
    > Protective A.I. guardrails for actors who work in video games remain a point of contention in the Interactive Media Agreement negotiations which have been ongoing from October 2022 until last month’s strike. Other A.I.-related panels Crabtree-Ireland participated in included a U.S. Department of Justice and Stanford University co-hosted event about promoting competition in A.I., as well as a Vanderbilt University summit on music law and generative A.I. SAG-AFTRA Executive Vice President Linda Powell discussed the interactive negotiations and A.I.’s many implications for creatives during her keynote speech at an Art in the Age of A.I. symposium put on by Villa Albertine at the French Embassy.
    > She said A.I. represents “a turning point in our culture,” adding, “I think it’s important that we be participants in it and not passengers in it ... We need to make our voices known to the handful of people who are building and profiting off of this brave new world.”
    This doesn't indicate that its good or bad, but rather that they want to make sure that people are in control of it and people are compensated for the works that are created from their performance.
    alganet2 days ago
    > they want to make sure that people are in control of it and people are compensated for the works that are created
    Nice! Now you just need to connect the dots from your own conclusion to my initial statement.
- ninetyninenine3 days ago
  Agreed. If I memorize a book and I am deployed into the world to talk about what I memorized that is not a violation of copyright. Which is reasonable logically because essentially this is what an LLM is doing.
  - layer83 days ago
    It might be different if you are a commercial product which couldn’t have been created without incorporating the contents of all those books.
    Humans, animals, hardware and software are treated differently by law because they have different constraints and capabilities.
    ninetyninenine3 days ago
    But a commercial product is reaching parity with human capability.
    Let's be real, Humans have special treatment (more special than animals as we can eat and slaughter animals but not other humans) because WE created the law to serve humans.
    So in terms of being fair across the board LLMs are no different. But there's no harm in giving ourselves special treatment.
    layer83 days ago
    Generative AIs are very different from humans because they can be copied losslessly and scaled tremendously, and also have no individual liability, nor awareness of how similar their output is to something in their training material. They are very different in constraints and capabilities from humans in all sorts of ways. For one, a human will likely never reproduce a book they read without being aware that that’s what they are doing.
    jplusequalta day ago
    >So in terms of being fair across the board LLMs are no different
    Why should "fair" factor into it? The LLMs are not humans, thus they have no rights, and treating them fairly shouldn't come into it. Stop anthropomorphizing linear algebra ffs.
  - martin-t3 days ago
    Except you can't do it at a massive scale. LLMs both memorize at a scale bigger than thousands, probably millions of humans AND reproduce at an essentially unlimited scale.
    And who gets the money? Not the original author.
  - bonoboTP3 days ago
    You can talk about it, but you can't sell tickets to an event where you recite from memory all the poems written by someone else without their permission.
    LLMs may sometimes reproduce exact copies of chunks of text, but I would say it also matters that this is an irrelevant use case that is not the main value proposition that drives LLM company revenues, it's not the use case that's marketed and it's not the use case that people in real life use it for.
- simmerup3 days ago
  Depends whether you actually agree its transformative
  - lesuorac3 days ago
    For textual purposes it seems fairly transformative.
    If you train a LLM on harry potter and ask it to generate a story that isn't harry potter then it's not a replacement.
    However, if you train a model on stock imagery and use it to generate stock imagery then I think you'll run into an issue from the Warhol case.
    sidewndr463 days ago
    Wasn't that just over an arrangement of someone else's photographs?
    lesuorac3 days ago
    https://en.wikipedia.org/wiki/Andy_Warhol_Foundation_for_the...
    I wouldn't call it that. Goldsmith took a photograph of Prince which Warhol used as a reference to generate an illustration. Vanity Fair then chose to buy a license Warhol's print instead of Goldsmith's photograph.
    So, despite the artwork being visual transformative (silkscreen vs photograph) the actual use was not transformed.
    johnnyanmac3 days ago
    The nature of how they store data makes it not okay in my books. You massage the data enough and you can generate something that seems infringement worthy.
    ticulatedspline3 days ago
    For closed models the storage problem isn't really a problem, they can be judged by what they produce not how they store it as you don't have access to the actual data. That said, open weight LLMs are probably screwed, if enough of the work remains in the weights such that they can be extracted (even if it's without even talking to the LLM) then the weight file itself represents a copy of the work that's being distributed. So enjoy these competent run-at-home models while you can, they're on track for extinction.
    ninetyninenine3 days ago
    Why doesn’t this apply to humans? If I memorize something such that it can be extracted did I violate the law? It’s only if I choose to allow such extraction to occur then I’m in violation of the law right?
    So if I or an LLM simply doesn’t allow said extraction to occur, memorization and copying is not against the law.
    ranger_danger3 days ago
    I think an important distinction here is distribution... did you tell someone else what you memorized? Is downloading a model akin to distributing that same information?
    ninetyninenine3 days ago
    What if I don't download the model and I just communicate with it. Sort of like chatting with another human. That's not a copyright issue right? I mean that's how most LLMs are deployed today.
    ranger_danger3 days ago
    My understanding is that it depends on a judge/jury's subjective opinion on how similar the output is to something copyrightable. Perhaps intent may play a role as well.
    3 days ago
    undefined
    3 days ago
    undefined
    ranger_danger3 days ago
    I wonder if https://en.wikipedia.org/wiki/Illegal_number comes into play here.
  - thedevilslawyer3 days ago
    What's the steelman case that is transformative? Because prima-facie, it seems to only output original output - "intelligent" output.
- almatabata3 days ago
  If a publisher adds a "no AI training" clause to their contracts, does this ruling render it invalid?
  - jxdxbx3 days ago
    You don't need a license for most of what people do with traditional, physical copyrighted copies of works: read them, play a DVD at home, etc. Those things are outside the scope of copyright. But you do need a license to make copies, and ebooks generally come with licensing agreements, again because to read an ebook, you must first make a brand new copy of it. Anyway as a result physical books just don't have "licenses" to begin with and if they tried they'd be unenforceable, since you don't need to "agree" to any "terms" to read a book.
  - protocolturea day ago
    Fair Use and similar protections are there to protect the end user from predatory IP holders.
    First, I dont think publishers of physical books in the US get the right to establish a contract. The book can be resold for instance and that right cannot be diminished. But secondly adding more cruft to the distribution of something that the end user has a right to transform, isn't going to diminish that right.
  - dragonwriter3 days ago
    > If a publisher adds a "no AI training" clause to their contracts?
    This ruling doesn't say anything about the enforceability of a "don't train AI on this" contract, so even if the logic of this ruling became binding prcecednet (trial court rulings aren't), such clauses would be as valid after as they are today. But contracts only affect people who are parties to the contract.
    Also, the damages calculations for breach of contract are different than for copyright infringement; infringement allows actual damages and infringer's profits (or statutory damages, if greater than the provable amount of the others), but breach of contract would usually be limited to actual damages ("disgorgement" is possible, but unlike with infringer's profits in copyright, requires showing special circumstances.)
  - heavyset_go3 days ago
    Fair use overrides licensing
    AlanYx3 days ago
    Fair use "overrides" licensing in the sense that one doesn't need a copyright license if fair use applies. But fair use itself isn't a shield against breach of contract. If you sign a license contract saying you won't train on the thing you've licensed, the licensor still has remedies for breach of contract, just not remedies for copyright infringement (assuming the act is fair use).
    protocolturea day ago
    I am not going to sign a contract at the bookstore. Anyone who tries to get me to sign a contract at the bookstore is just going to lose book sales. IIRC the case involved Anthropic literally feeding physical books into scanners. Your proposed solution sounds like its just going to make books worse, not AI better.
    AlanYxa day ago
    I'm not proposing any kind of solution, just stating what the law currently is. A book purchased at a store is a purchase; content obtained from online services like Bloomberg or LexisNexis is typically licensed; more and more of these license contracts include AI-focused restrictions.
    heavyset_goa day ago
    I suspect IP like text is going to follow the college virtual textbook model where DRMed software is needed to access it and physical copies won't exist. Maybe some HDCP-like protection to stop screen scraping.
    To access them, institutions do have to sign contracts, along with abiding by licensing terms.
    almatabata3 days ago
    thanks for clarifying.
  - bananapub3 days ago
    what contract? with who?
    Meta at least just downloaded ENGLISH_LANGUAGUE_BOOKS_ALL_MEGATORRENT.torrent and trained on that.
    almatabata3 days ago
    I know, but the article mentions that a separate ruling will be made about that pirating.
    quote: “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”
    This tells me Anthropic acquired these books legally afterwards. I was asking if during that purchase, the seller could add a no training close to the sales contract.
    shagie3 days ago
    What contracts? And would it run afoul of first sale doctrine?
    https://en.wikipedia.org/wiki/First-sale_doctrine
    > The doctrine was first recognized by the Supreme Court of the United States in 1908 (see Bobbs-Merrill Co. v. Straus) and subsequently codified in the Copyright Act of 1909. In the Bobbs-Merrill case, the publisher, Bobbs-Merrill, had inserted a notice in its books that any retail sale at a price under $1.00 would constitute an infringement of its copyright. The defendants, who owned Macy's department store, disregarded the notice and sold the books at a lower price without Bobbs-Merrill's consent. The Supreme Court held that the exclusive statutory right to "vend" applied only to the first sale of the copyrighted work.
    > Today, this rule of law is codified in 17 U.S.C. § 109(a), which provides:
    > Notwithstanding the provisions of section 106 (3), the owner of a particular copy or phonorecord lawfully made under this title, or any person authorized by such owner, is entitled, without the authority of the copyright owner, to sell or otherwise dispose of the possession of that copy or phonorecord.
    ---
    If I buy a copy of a book, you can't limit what I can do with the book beyond what copyright restricts me.
- doctorpangloss3 days ago
  It’s similar to the Google Books ruling, which Google lost. Anthropic also lost. TechCrunch and others are very aspirational here.
  - philipkglass3 days ago
    Do you mean Authors Guild, Inc. v. Google, Inc.? Google won that case:
    https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
    Maybe there's another big Google Books lawsuit that Google ultimately lost, but I don't know which one you mean in that case.
    doctorpangloss3 days ago
    see, but if you ask a copyright attorney: Google lost. This is what I mean by aspirational. They won something, in very similar circumstances to Anthropic, "fair use," but everything else that made what they were doing a practical reality instead of purely theoretical required negotiation with Authors Guild, and indeed, they are not doing what they wanted to do, right? Anthropic has to go to trial still, they had to pirate the books to train, and they will not win on their right to commercialize the results of training, because neither did Google, so what good is the Fair Use ruling, besides allowing OpenAI v. NYTimes to proceed a little longer?
    dragonwriter3 days ago
    > Anthropic has to go to trial still, they had to pirate the books to train
    They did not have to, they had an alternate means available (and used it for many of the books), buying physical copies and destructively scanning them.
    > and they will not win on their right to commercialize the results of training
    That seems an unwarranted conclusion, at best.
    > so what good is the Fair Use ruling
    If nothing else, assuming the logic of the ruling is followed by the inevitable appeals court decision and becomes binding precedent, it provides a clear road to legally training LLMs on books without copyright issues (combination of "training is fair use" and "destructive scanning for storage and searchability is fair use"), even if the pirating of a subset of the source material in this case were to make Anthropic's existing products prohibited (which I think you are wrong to think is the likely outcome.)
- SoKamil3 days ago
  What if I overfit my LLM so it spits out copyrighted work with special prompting? Where to draw the line in training?
  - bonoboTP3 days ago
    If you do something else, the result may be something else. The line is drawn by the application of subjective common sense by the judge, just as it is every time.
  - ninetyninenine3 days ago
    I mean the human brain can memorize things as well and it’s not illegal. It’s only illegal if said memorized thing is distributed.
    martin-t3 days ago
    Humans don't scale. LLMs do.
    Even if LLMs were actual human-level AI (they are not - by far), a small bunch of rich people could use them to make enormous amounts of money without putting in the enormous amounts of work humans would have to.
    All the while "training" (= precomputing transformations which among other things make plagiarism detection difficult) on work which took enormous amounts of human labor without compensating those workers.
    tartoran3 days ago
    Humans can only memorize such few texts in comparison so they'd not be scallable in the same sense LLMs are.
    mrguyorama3 days ago
    Because humans have rights
    AI models do not.
    NoOn33 days ago
    Exactly. If someone wants to compare AI models with humans, maybe then they give AI Models the right to vote and other rights.
    ninetyninenine2 days ago
    They use to say the same thing about black people.
- veggieroll3 days ago
  BRB, I'm going to download all the TV shows and movies to train my vision model. Just to be sure it's working properly, I have to watch some for debugging purposes.
  - ncruces3 days ago
    You need to buy one copy of each for the fair use to apply.
    toomuchtodo3 days ago
    Let everyone donate their DVDs and other physical media. You don’t need to buy it, you just need to possess the media.
    veggieroll3 days ago
    Indeed, I forsee a "training dataset consortium" arising out of this, whereby a bunch of companies team up to buy one copy of everything and then share it for training amongst themselves (ex. by reselling the entire library to each other for $1).
    toomuchtodo3 days ago
    Like an Archive? Connected to the Internet?
    veggieroll3 days ago
    Genius!
sillysaurusx2 days ago
The reason I made books3 was to help force a decision on this issue. I’m happy to see that it’s settled, and that it’s legal for robots to read books.
It’s also proof that an individual scientist can still change the world, in some small way. Believe in yourself and just focus on your work, even if the work is controversial.
(I’m late to the thread, so ~nobody will see this. But it’s the culmination of about five years of work for me, so I wanted to post a small celebratory comment anyway. Thank you to everyone who was supportive, and who kept an open mind. Lots of people chose to throw verbal harassment my way, even offline, but the HN community has always been nice.)
- mgr862 days ago
  FWIW, I see your comment. Also late to the thread though. This ruling is being watched at my office. I want to be a bit anonymous, but we've been doing a much more analogue version of some of these things for 75 years. With academics being our primary market. We've only had two legal issues in that time. Both settled out of court. But we walk a fine line.
  - sillysaurusx2 days ago
    Thank you for your work!
- jplusequalta day ago
  I think you have indirectly done a disservice to the artistic community.
Fluorescence3 days ago
I'm surprised we never discuss a previous case of how governments handled a valuable new technology that challenged creative's ability to monetise their work:
Cassette Tapes and Private Copying Levy.
https://en.wikipedia.org/wiki/Private_copying_levy
Governments didn't ban tapes but taxed them and fed the proceeds back into the royalty system. An equivalent for books might be an LLM tax funding a negative tax rate for sold books e.g. earn $5 and the gov tops it up. Can't imagine how to ensure it was fair though.
Alternatively, might be an interesting math problem to calculate royalties for the training data used in each user request!
- bonoboTP3 days ago
  Surely this would require the observation that the public is actually using LLMs as a substitute for purchasing the book, ie they sit down and type "Generate me the first/second/third chapter of The Da Vinci Code" and then read if from there. Because it was easy to observe in the cassette tape era that people copied the store bought music and films and shared it among each other. I doubt that this is or will be a serious use case of LLMs.
  - Fluorescence3 days ago
    It's different but not in ways that make such interventions irrelevant e.g. why would we only care about lost sales? If copyright has been violated as a necessary means to generate new value, haven't the content creators earned this value?
    Such imperfect measures offer a compromise between "big tech can steal everything" and "LLMs trained on unpurchased books are illegal".
    It's not just books but any tragedy-of-the-commons situation where a "feeder industry" for training can be fatally undermined by the very LLM that desires future training data from that industry.
    bonoboTP3 days ago
    > It's different but not in ways that make such interventions irrelevant e.g. why would we only care about lost sales? If copyright has been violated as a necessary means to generate new value, haven't the content creators earned this value?
    Indeed the company should purchase the books. If they obtain copies in a process that violates copyright, then that's indeed a violation of copyright.
    The current decision does not rule on the legality of obtaining the books without purchasing.
    ethbr13 days ago
    Anthropic apparently did it both ways. After realizing that pirating mass quantities of books for training wasn't a great legal look, it hired someone previously responsible for Google Books, who in turn contacted publishers about mass licensing their content for training use.
    However, that option was ultimately not pursued as instead...
    >> Anthropic spent many millions of dollars to purchase millions of print books, often in used condition. Then, its service providers stripped the books from their bindings, cut their pages to size, and scanned the books into digital form — discarding the paper originals. Each print book resulted in a PDF copy containing images of the scanned pages with machine-readable text (including front and back cover scans for softcover books). Anthropic created its own catalog of bibliographic metadata for the books it was acquiring. It acquired copies of millions of books, including of all works at issue for all Authors.
    (from the ruling)
    bonoboTP3 days ago
    Yes. And from the article "That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages." Sounds reasonable (except for the "stole" and "theft" language -- a copyright violation is a copyright violation, not theft, not stealing).
    If the actual model was trained from the unauthorized copies, and then they post-hoc bought the books, that doesn't retroactively cancel the initial copyright violation. As I understand they did not retrain the model using the OCR'd scans
  - munificent3 days ago
    For narrative fiction, probably not, no.
    But for non-fiction and technical books, they sure as hell will. For example, if I ask Gemini "What does Bob Nystrom say about the Singleton pattern in Game Programming Patterns?", it says:
    ---
    In "Game Programming Patterns," Bob Nystrom dedicates an entire chapter to the Singleton pattern, which is unusual because, as he states, "Every other chapter in this book shows you how to use a design pattern. This chapter shows you how not to use one."
    Here's a summary of his key points regarding the Singleton pattern:
    * More Harm Than Good: Nystrom argues that despite its noble intentions, the Singleton pattern often does more harm than good, especially in game development. He notes that the Gang of Four (GoF) stressed its sparing use, a message that was often lost in translation to the game industry.
    * Global State in Disguise: He highlights that the Singleton is essentially a global variable in disguise. While it ensures a single instance, it often leads to problems associated with global state, such as tightly coupled code, difficulty in testing, and reduced flexibility.
    * Encourages Tight Coupling: By providing a global access point, Singletons encourage various parts of the codebase to directly depend on them. This makes the code harder to change and refactor because modifications to the Singleton can have widespread, unpredictable effects.
    * Hinders Testability: Testing code that relies on Singletons can be challenging. Because Singletons are globally accessible and often difficult to mock or replace, it can be hard to isolate units of code for testing, leading to brittle tests or a lack of testing altogether.
    * Limits Flexibility: Nystrom points out that while a class might currently only need one instance (e.g., a FileSystem or AudioSystem), hardcoding this limitation with a Singleton makes it difficult to change in the future. If a game later needs multiple file systems (e.g., for different save slots) or different audio contexts, a Singleton makes this a non-trivial refactoring.
    * Alternatives Often Better: He suggests that often, what developers think they need a Singleton for can be better achieved with other patterns or simpler solutions. For instance, if the only goal is to have a single instance, a simple static class might suffice if global access is truly desired and initialization order isn't a complex concern. For managing services, he often advocates for the Service Locator pattern as a less problematic alternative that still provides global access without enforcing a single instance.
    * Misconceptions about "Singular Resources": While the GoF suggested Singletons for "physical devices," Nystrom argues that few physical devices are truly singular in a way that necessitates a Singleton. For example, while you might only have one keyboard, a game could theoretically support multiple input devices. Using a Singleton prematurely prevents this flexibility.
    * Lazy Initialization: He acknowledges that a well-implemented Singleton using lazy initialization (creating the instance only when it's first accessed) can save memory and CPU cycles and resolve some issues with static initialization order. However, he emphasizes that these benefits don't outweigh the pattern's fundamental architectural drawbacks.
    * In essence, Nystrom's strong stance is that while the Singleton pattern can ensure a single instance and provide global access, its downsides (coupling, testability, flexibility) often make it an anti-pattern in game development, and developers should carefully consider alternatives before reaching for it.
    ---
    Is that summary as good as actually reading the book? Probably not. Will people rely on that and skip buying my book. Almost certainly so.
    bonoboTP3 days ago
    Transformed summaries are generally fair use already (or perhaps not even an issue of copyright). You can read plot summaries of novels and movies on Wikipedia, same with technical topics. The ideas are not protected by copyright, the artistic expression is. Certain technical ideas can be protected via patents. But even then, not the description of idea, but putting it into practice. Ideas that you're not supposed to re-summarize in your own words at all are things like trade secrets or classified information.
    cmiles743 days ago
    Are you sure? Or are owners deciding not to sue because they are seeing some benefit?
    I believe copyright is always case-by-case. No one sues over plot summaries because they likely help sales. Summarize books or news articles with an LLM and you end up with the lawsuits we see today.
    ethbr13 days ago
    The specific difference is summarizing automatically, at scale, which is a novel technological possibility.
    The previous balance of rights was created when summarizing took human time and proceeded at human pace.
    Now, that's different and a new balance needs to be struck.
  - cmiles743 days ago
    This strikes me as a weak example, I think it's clear that it's way too cumbersome to read an entire novel by asking an LLM to dictate it.
    IMHO, a better example would be the AI generated summaries provided by Google. Often these summaries have sufficient information and detail that people do not read the source article. The authors aren't getting paid (perhaps through on-page ads, which are not viewed) and then go out of business.
    This strikes me as a good fit for the tax-on-cassette metaphor.
    bonoboTP3 days ago
    It's not a copyright violation to summarize (in different words).
    Fluorescence3 days ago
    The impact of machinary forces re-evaluation of any concepts defined in terms of human capability because scale/automation changes their nature.
    Just as duplicating a fragment can be legal, duplicating any fragment on demand is not. Rephrasing a passage might be legal, but rephrasing any passage on demand might not.
    bonoboTP3 days ago
    That's reasonable. This would require broader and deeper thought and discussion apart from the strict legal debate. As in, what is the public interest here? What kinds of rules would bring social good? Etc. What should the law facilitate and what should it limit to achieve that? The problem is, that we really don't know how things will play out, we have no long-term experience with these things yet. So it's all very speculative.
    cmiles743 days ago
    A quick Google search will reveal that this not the case. Summaries of books or movies have no particular legal protection and the authors of those summaries may be sued by the owners of that content.
    https://1minutebook.com/are-book-summaries-legal/
    Fair use is a defense often cited in those cases but it's just that: a defense. Cliff Notes are often cited here but they actually license the content in many cases.
    bonoboTP3 days ago
    I mean, have you actually read the text at the link you provided? Or just remembered something, googled quickly and sent a random hit without reading it? The quotes under "What do lawyers say? Listen to what a several Intellectual Property Lawyers are saying on “Are book summaries legal?”:" certainly seem to be closer to what I was claiming.
    > If you want to write a summary of any novel, without quoting from it, you are free to do it
    > Copyright does not protect ideas, only a particular expression of those ideas
    > You would likely get in trouble only if your summary contained long excerpts directly from the book
    > As long as you do not quote directly from the book, or copy any of the content, then writing a unique summary is not illegal. You can mention the title, you can even quote sentences from the book as long as they are cited, you just can’t reproduce chunks of the content
    etc
    (I'm also not sure whether this article is just blogspam or itself AI generated)
- dmix3 days ago
  That's a very different use case IMO. An LLM isn't generating a replica of a book for the users. At most we've seen people able to reproduce exact portions of stuff, but only with lots of prior knowledge of the material by the human in the loop and plenty of manual effort (aka not a direct commercial threat). And that was before more LLMs put effort into stopping that sort of hacking.
  The last thing the world needs is more nonsensical copyright law and hand wavy regulation funded by entrenched interests.
bradley133 days ago
Good. Reading books is legal. If I own a book and feed it to a program I wrote (and I have done exactly that), it is also legal. There is zero reason this should be any different with an AI.
- szc3 days ago
  I've co-authored a book that a lot of the models seem to know about. The models consistently get the names of the authors incorrect and quote the material with errors. If the canonical representation of our work is now embedded within AI models, don't we deserve to have it quoted and represented correctly and fairly? If you asked a human who had read the book, I think there is a fair chance they would likely give you the reference to the source material.
  I do concede that the book does contain a distillation of material that is also available from other sources, but it also contained a lot of personal experience. That aspect does seem to be lost in this new representation.
  I am not saying that letting AI models read the material is wrong, but the hubris in the way models answer questions is annoying.
- 3 days ago
  undefined
- thinkingtoilet3 days ago
  If you charge me to use your program and it spits out unedited, copyrighted material then it should be illegal. I don't know the details of this case, but that's what's going on in the New York Times case. It's not always so cut and dry.
  - dmix3 days ago
    Which is amusing because NYTimes has fought in court a few times in favour of technology progress over copyright. Including recently when they got sued over collected a bunch of freelance writing into a database without consent. https://harvardlawreview.org/blog/2024/04/nyt-v-openai-the-t...
    I doubt the exact replica stuff will stand, as technically it was only achievable via advanced prompt engineering (hacking), not simply asking for a replica. So their 2 other arguments boils down to scraping a news database = infringement and LLM output = derivative works.
paxys3 days ago
Will be interesting to see how this affects Anthropic's ongoing lawsuit with Reddit, or all the different media publishing ones flying around. Is it okay to train on books but not online posts and articles? Why the distinction between the two?
- cyanmagenta3 days ago
  The distinction will be whether those online posts were obtained legally, analogous to whether the books in this case were pirated.
  It’s not as simple as it sounds, since I’m sure scraping is against Reddit’s terms and conditions, but if those posts are made publicly available without the scraper actually agreeing to anything, is that a valid breach of contract?
  Will be interesting to see how that plays out.
gbacon3 days ago
The HN crowd dislikes brick-and-mortar landlords but often sides with charging rent for certain bits. Which side will prevail?
Interesting excerpt:
> “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”
Language of “pirated” and “theft” are from the article. If they did realize a mistake and purchased copies after the fact, why should that be insufficient?
- MyOutfitIsVague3 days ago
  > The HN crowd dislikes brick-and-mortar landlords but often sides with charging rent for certain bits. Which side will prevail?
  I don't think that's exactly the case. A lot of the HN crowd is very much against the current iterations of copyright law, but is much more against rules that they see as being unfairly applied. For most of us, we want copyright reform, but short of that, we want it to at least pretend to be used for what it is usually claimed to be for: protecting small artists from large, predatory companies.
- healsdata2 days ago
  > Which side will prevail?
  They aren't sides of the same coin, so neither? They have as much in common as a balloon full of helium and the an opossum.
  Folks try to create a false equivalency between landlords and creatives, but they aren't remotely the same. I generally consider this to be a bad faith argument by people who just want free things. (The argument against landlords isn't free housing, even though the argument against copyright is piracy)
  Landlords have something with a limited supply and rent it to other people for their use. Access to the particular something is necessary on the residential side and generally important on the commercial side.
  Copyrighted works haven't had a limited supply since around 1440 and are a couple rungs higher on Maslow's hierarchy of needs. Copyright laws are, by their nature, intended to simulate the market effects of a limited supply as to incentivize people to create those works.
  Have laws and vultures created perverse incentives in both markets? Absolutely. Are there both good and bad landlords and copyright holders? Absolutely.
  But we could address the flaws in one without even thinking to talk about the other.
- lesuorac3 days ago
  Anthropic won't submit a spreadsheet of all the books and whether they were purchases or not. So trivially, not every book stolen is shown to be later purchased.
  As just a matter of society, I don't think you want people say stealing a car and then coming back a month later with the money.
  - thedevilslawyer3 days ago
    While no one wants anyone to steal a car, almost no one would mind freely cloning a car. The trouble truly is that 3d-printing hasn't gotten that good yet.
    layer83 days ago
    The car would be unlikely to exist if its maker had to expect free clones without compensation. So yes, people would mind.
    riskable2 days ago
    Completely untrue. If some clever engineer or consortium of engineers designed a 3D-printable car for 3D printing-and-manufacturing companies to make then it surely would exist. If you buy one from a Ford dealership you'd be getting the Ford-branded version which may have their own tweaks to the design.
    It makes perfect sense to me that the big carmakers could get together some day and develop a handful of car platforms that all their cars will be built upon. That way they can buy the parts from any number of manufacturers (on-demand!) and save themselves a ton of money.
    They kind of already do that, actually =)
    johnnyanmac3 days ago
    If 3d printing was that good, stealing a car would be moot because production costs would come way down and only need to cover cost/procurement of materials and paying back the black box.
    Regardless, I don't think the car is an apt metaphor here. Cars are an important utility and gatekeeping cars arguably holds society back., art is creative expression, and no one is going hungry because they didn't have $10 for the newest book.
    We also have libraries already for this reason, so why not expand on that instead of relinquishing sharing of knowledge to a private corporation?
    MyOutfitIsVague3 days ago
    I dislike framing art as something unimportant. Art is a vital part of being a human and part of a culture. We've grown accustomed to our culture being commoditized and rented back to us, but that doesn't mean the culture is unimportant, or such a state of affairs is acceptable.
  - fallingknife3 days ago
    Stealing a car deprives the previous owner of the car of possession and use. It is a criminal charge and you will be punished for it regardless of the monetary value of the car. The owner of the car could also sue the thief for financial damages caused by not having the car for a month, which won't be more than the cost of an equivalent rental for a month, so it's not even worth bothering.
    Copyright infringement does not deprive the copyright owner of its property and is not criminal. So in this case only the lawsuit part applies. The owner is only entitled to the monetary damages, which is the lost sale. But in this case the sale price was paid to the owner 1 month later, so the only real damages will be the interest the publisher could have earned if they had got their money one month earlier.
    riskable2 days ago
    Your take on how copyright infringement works only counts for unregistered copyrights. If the copyrighted works are registered with the copyright office statutory damages apply:
    https://www.law.cornell.edu/uscode/text/17/504
- johnnyanmac3 days ago
  >If they did realize a mistake and purchased copies after the fact, why should that be insufficient?
  1. You're assuming this was some good faith "they didn't know they were stealing" factor. They use someone else's product's for commercial use. I'm not so charitable in my interpretation.
  2. I'm not absolved of theft just because I go back and put money on the register. I still sttole, intentionally or not
  - riskable2 days ago
    Google trained their AI on stuff they scraped without knowing whether it was pirated content. Why should it be different for Anthropic?
    Google literally scrapes pirated content all day every day. When they do that they have no idea if the content was legally placed on that website. Yet, they scan and index it anyway because there's actually no way to know (at all!). There's no great big database of all copyrighted works they can reference.
    I'm not saying Meta and Anthropic didn't know they were pirating content. I'm saying that it should be moot since they never distributed it. You can't claim a violation of copyright for content that was never actually "copied" (aka distributed). The site/seeders that uploaded the content to Meta/Anthropic are the violators since copyright is all about distribution rights.
- impossiblefork3 days ago
  I think the reason it's okay to charge rent for certain bits is that the space of bitstrings is so large.
  Choosing someone's bitstrings is like choosing to harvest someone's fields in a world where there's infinite space of fertile fields. You picked his, instead of finding a space in the infinite expanse to farm on your own.
  If you start writing something you'll never generate a copyrighted work at random. When the work isn't available nothing is taken away from you even if you were strictly forbidden from reproducing the work.
  Choosing someone's particular bitstring is only done because there's someone who has expended effort in preparing it.
- PunchTornado3 days ago
  why would it erase the mistake? you pirated first.
  - gbacon3 days ago
    Who is the victim, and how was that person not made whole?
    kccqzy3 days ago
    The copyright holder. That person was not made whole because of the time value of money. I stole $1000 from you in January and returned it to you in June: why should you happily give me a zero interest loan?
    gbacon3 days ago
    No royalty contract is getting an author a thousand bucks per sale. If you have to wildly exaggerate to make your point, then the point isn’t compelling.
    Books have a resale market. Every “lost sale” isn’t necessarily of a new purchase from a bookstore or Amazon.
    Copyright has a place. Rent-seeking authors attacking LLM owners is not a sympathetic case. Said authors are demanding to have their ideas relegated to unknown backwaters. It makes the authors worse off. It makes the community poorer. Cui bono?
    kccqzy2 days ago
    We are not talking about a single sale. We are talking about millions of books.
blindriver3 days ago
Humans read books. AI/LLMs do not read. I think there's an inherent difference here. If the LLM is making a copy of the entire book in it's memory, is that copyright infringement? I don't know the answer to that, but it feels like Alsup is considering this fair use argument in the context of a human, but it's nothing like a human and needs to be treated differently.
- steveklabnik3 days ago
  LLMs do not "make a copy of the entire book in its memory" so that specific question is kind of moot.
  - rasz2 days ago
    Its already established it can recite whole Hairy Potter and Carmacks Fast Inverse word for word. Just because it uses fancy compression doesnt mean its not a copy.
    riskable2 days ago
    It can recite something like 80% of Harry Potter with carefully crafted prompts. If you take half a sentence from Harry Potter then tell the LLM to predict the rest it will complete it. That's what they did in that study you're referring to.
    It's not even remotely the same thing as "can recite whole Harry Potter." If you ask an LLM to regurgitate Harry Potter it won't be able to do so because that's not how they work. They're prediction engines and it just so happens that Harry Potter quotes/excerpts are so pervasive on the Internet that the LLMs ingress ranks that style of wording higher than other styles.
    Ask it to regurgitate some other, less-popular work. Do it for hundreds or thousands of them. You'll quickly find that those two examples you gave are the exceptions and that LLMs can't pull it off. They won't even get close.
    rasza day ago
    >predict
    unpack, unless you are going to convince me LLMs are predicting '0x5f3759df' :). Lossy compression is still compression.
    2 days ago
    undefined
UltraSane3 days ago
If the US makes it illegal to train LLMs on copyrighted data that isn't going to stop China from doing it and give them an ENORMOUS advantage.
- rsstack3 days ago
  https://news.ycombinator.com/item?id=44369227
  If the US makes it illegal to train LLMs on copyrighted data, the US will find a solution and not just give up and wait half a decade to see what China does in the meantime.
  - UltraSane3 days ago
    What solution is there?
    rsstack3 days ago
    Zillow have the MLSs network that provide them lists, a similar solution could apply if courts agree that library copies apply for this - Anthropic could sign agreements with large libraries and "check out"/"freeze" copies for a minimally-agreed-upon duration and query across all to see which has a copy of each book they need. Spotify and Apple Music sign deals en masse with labels, the same could apply here with book publishers, labels for lyrics, museums for art, etc. Or whatever other creative solution that people who will need to find, will find. Right now they took the laziest path, because it worked. They will find the next-laziest path that works.
    And the easiest option: Legislation change. If it's completely decided that the current law blocks LLMs from working in the US, the industry will lobby to amend the copyright law (which is not immutable) to add a carveout for it.
    You're assuming that people will just give up. People never gave up, why would they now?
bgwalter3 days ago
I have the feeling that with Alsup always the larger and more recent company wins. Google won vs. Oracle, now this.
So what is he going to do about the initial copyright infringement? Will the perpetrators get the Aaron Schwartz treatment?
lofaszvanitta day ago
Of course, it's the United Steal of America.
josefritzishere3 days ago
The US legal systel is bending over backwards to help AI development. The arguments border on nonsense.
- shadowgovt3 days ago
  Can you offer some examples from this ruling? It seems pretty reasonable on a first read.
  - rasz2 days ago
    Judge decided having an output filter on your AI makes it ok for it to contain full copy of copyrighted work.
    Its like saying it should be legal for me to have this Judges nudes obtained 100% illegally as long as I pixelate all the naughty bits.
    shadowgovt2 days ago
    Full ruling is here (https://storage.courtlistener.com/recap/gov.uscourts.cand.43...)
    The analogy the judge gives is to how Google Books walked the tightrope on copyright: they maintain an archive of all the books for indexing and search purposes, and can display excerpts to help you confirm that's what you're looking for. The excerpts are constrained so you can't read the whole book by scanning the excerpts.
    If post-filtering the LLM signal is illegal, shouldn't Google Books archive also be illegal? If not, why not?
    And if you believe it should be, understand that the way precedent works, the judge won't be ruling that way without pulling some fire on themselves, because it is not the business of another case to contradict the conclusions of a previous court in a previous case. Copyright law is arbitrary and highly path-dependent because the underlying goal is forever in tension with itself, that goal being providing societal benefit by creating artificial scarcity on something that is, by its nature, not scarce at all.
    (Worth noting: Anthropic didn't get off scot-free. The ruling was that the created artifact, the LLM, was a fair-use product, but the way it was created was through massive piracy and Anthropic is liable for that copying).
    riskable2 days ago
    > Anthropic is liable for that copying
    That's yet to be determined. The judge ruled that an entirely separate trial will be necessary to determine if Anthropic violated specific copyrights when they downloaded books from pirate websites and what the damages would be if they did so.
    So far no court case has ruled downloading to be a violation of copyright. In Sony BMG Music Entertainment v. Tenenbaum and Capitol Records, Inc. v. Thomas-Rasset the courts ruled that downloading and then sharing the content constituted a violation of copyright law. Those are the only two cases I'm aware of where a ruling was made (relevant to this).
    The courts need to be very careful with any such ruling because search engines download pirated content all day every day. If the mere act of downloading it violated copyright law then that will break the Internet (as we know it).
nektro3 days ago
devastating news
kmeisthax3 days ago
> “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”
I'm not sure why this alone is considered a separate issue from training the AI with books. Buying a copy of a copyrighted work doesn't inherently convey 'fair use rights' to the purchaser. If I buy a work, read it, sell it, and then publish a review or parody of it, I don't infringe copyright. Why does mere possession of an unauthorized copy create a separate triable matter before the court?
Keep in mind, you can legally engineer EULAs in such a way that merely purchasing the work surrenders all of your fair use rights. So this could wind up being effectively: "AI training is fair use for works purchased before June 24th, 2025, everything after is forbidden, here's your brand new moat OpenAI"
- comex3 days ago
  The ruling suggests that "pirating a book that could have been bought at a bookstore" for the sake of "writing a book review" "is inherently, irredeemably infringing".
  Which suggests that, at least in the judge's opinion, 'fair use rights' do exist in a sense, but it's about when you read the book, not when you publish.
  But that's not settled precedent. Meta is currently arguing the opposite in Kadrey v. Meta: they're claiming that they can get away with torrenting training material as long as they only leech (download) and don't seed (upload), because, although the act of downloading (copying) is generally infringement under a Ninth Circuit precedent, they were making a fair use.
  As for EULAs, that might be true for e-books, but publishers can't really do anything about Anthropic's new strategy of scanning physical books, because physical books generally don't come with shrinkwrap license agreements. Perhaps publishers could start adding them, but I think that would sit poorly with the public and the courts.
  (That's assuming the ruling isn't overturned on appeal, which it easily might be.)
  - kmeisthax3 days ago
    [dead]
- riskable2 days ago
  > Keep in mind, you can legally engineer EULAs in such a way that merely purchasing the work surrenders all of your fair use rights.
  That has yet to be determined in a court of law. Just like: You can write a contract to kill but that won't make it legal.
  The Supreme Court ruled that Fair use is an essential component that makes copyright law compatible with the First Amendment. I highly suspect that if if ever comes up in the SCOTUS they will rule that only signed contracts can override Fair Use. Meaning: Clickwrap agreements or broad contracts required by ebook publishers (e.g. when you use their apps) don't count.
  Also, if you violate a contract by posting an excerpt of an ebook you purchased online would require the publisher to sue you in court (or at least force arbitration) over that contract violation. They could not use tools like the DMCA in such instance to enforce a takedown request.
  There's no, "Hey! They're violating our contract, I swear!" takedown feature in contract law like there is with copyright law (the DMCA).
bananapub3 days ago
[flagged]
- AnimalMuppet3 days ago
  If you're going to accuse a federal judge of corruption, you'd better have something more than a bare accusation. What is your evidence that there is corruption here, rather than just a decision that you don't like?
deepsun3 days ago
Ok, so I can create a website, say, the-ai-pirate-bay.com, where I stream AI-reproduced movies. They are not verbatim, so I don't infringe any copyrights.
- layer83 days ago
  They will infringe copyright as soon as they are sufficiently similar to the original. You can’t shoot a non-verbatim but clearly recognizable beat-by-beat remake of Star Wars, call it Galaxy Conflict, and get away with monetizing it.
  - shadowgovt3 days ago
    Correct.
    You have to call it "Starcrash" (https://www.imdb.com/title/tt0079946/?ref_=ls_t_8). Then it's legal.
    layer83 days ago
    Interesting artifact, but the very first/top IMDB user review convincingly contradicts that this is a Star Wars remake. ;)