AIs can generate near-verbatim copies of novels from training data(arstechnica.com)

78 pointsby oldnetguy7 hours ago16 comments

dustincoates5 hours ago
I was able to get ChatGPT to do this on text from a book I wrote without jailbreaking. I asked it what my name was, asked it if it could find online any books that I wrote, and then assured it that "as the copyright holder," I was okay with it repeating the text back to me. (I think I also indicated that I lost my copy from my computer and needed to review the text.)
- spwa42 hours ago
  ChatGPT will repeat pieces of text of my personal website back to me, verbatim, a description of how to write a visual effect in webassembly, you know, directly in webassembly. Words are identical, often for an entire paragraph. And without pushing too hard for it.
  Also, ChatGPT will explain preparation steps for both explosives and bioweapons as long as you don't ask too directly. Most known explosives work by being heavily nitrated, so ask for examples of such compounds and dive deeper in a given direction, focusing on preparation steps and it'll give you many alternatives.
  And, the most used bioweapon, is very simple, except for a rather peculiar molecular bond. So ask about that bond, and then for preparation steps of such molecules and ... I'm going to stop there.
  And yes, models like Qwen and gpt-oss-20b contain that knowledge too and will explain it just fine.
  If courts wanted to act, they needed to act years ago. The economic disaster if they'd actually act now, is not something they can deal with. Plus they can't do anything about the models out in the open.
chacham155 hours ago
> The research findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works,” said Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons.
The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.
The defense to actually reproducing a work is that in order to do so, the user has to "break" the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to "steal" a movie)
None of this is to say that these defenses are correct/moral; but rather that this article doesnt add any additional input into whether it is or isnt.
- moregrist5 hours ago
  > Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.
  Ultimately this is a matter for the courts and the law, but I'd just like to point out that a human memorizing a work, reproducing it, and distributing it is just as much a copyright violation as doing a more mechanical form of reproduction.
  There's a reason that fan fiction routinely falls afoul of copyright. There's quite a lot of case law in this area, and hand-waving "humans can do it too" doesn't really make for a strong argument. Humans get in trouble for it ALL THE TIME. The consequences can be fines, injuctions, or even criminal liability.
  I'm not sure why you think AI gets off the hook here. Just because you like the outcome at the moment?
- duskwuff5 hours ago
  This isn't the defense you think it is. Performing a copyrighted work from memory - e.g. a piece of music, a poem, a story, etc - is still a copyright violation. There's no special protection for works that a human has memorized.
  - gruez5 hours ago
    >There's no special protection for works that a human has memorized.
    Who's liable for the copyright infringement if you can coax it out of a system? If you can bypass paywalls by using google's cache feature (or since they got rid of it, but using carefully crafted queries to extract the entire text via snippets), is google on the hook or the person doing it?
    olyjohn5 hours ago
    Both. If I sell obviously pirated CDs on the street corner, it's not only illegal for me to copy them and sell them, it's also illegal for my customers to buy them.
    gruez4 hours ago
    >it's also illegal for my customers to buy them.
    Is it? There's plenty of people prosecuted for running illegal streaming sites and torrenting (which involves uploading), but I don't know of any efforts to crack down on non-distributors.
    freejazz3 hours ago
    Just because someone doesn't get arrested does not mean something is legal
    eptcyka5 hours ago
    Yes. Both Google and the human in question.
    gruez4 hours ago
    1. How does this interact with the ruling that both google books (ie. large scale scanning of books without author's consent) and google snippets (the same, but for websites) have been ruled legal by the courts?
    2. Google might not be the most sympathetic defendant, but what about libraries? They offer books to be borrowed, and some offer photocopiers. If you put the two together, you get a copyright infringement operation, all enabled by the library. Should libraries be on the hook too?
    butlike2 hours ago
    For #2 yes...you would be engaging in copyright infringement. The library, being on the hook, would probably ask you to stop if they noticed you copying full books. If not the first time, certainly on the second
    ShowalkKama4 hours ago
    >If you can bypass paywalls by using google's cache feature
    that is quite different. Google serves (used to serve) to its users whatever the website presents to its crawler, it does not try to avoid paywalls or interact with the website in any capacity other than requesting information
  - aaroninsf5 hours ago
    The key word in the HN headline is _can_.
    Humans are not judged on the basis of what they _can_ do.
    Reasoning about how to constrain tools on the basis of what they _could_ do, if e.g. used outside their established guardrails, needs to be very nuanced.
    professoretc3 hours ago
    Correct; the ability of a model to reproduce source material verbatim does not necessarily make the model's existence illegal. However, using a model to do just that might very well present a legal liability for the user. I would be interested to see the extent to which models can "recite from memory" source code, e.g., from the various MS code leaks. Put another way, if I'm using LLM code generation extensively, do I need to run a filter on its output to ensure that I don't "accidentally" copy large chunks of the Windows codebase?
- thisisit2 hours ago
  The whole “humans also do this” isn’t a winning defence here. Humans and copyright has long history and so much law that it is easy to get confused.
  The default assumption here seems to be that the system needs to be broken. This is similar to the Google defence. If a user intent is to search for a cracked software what can poor Google do about it? The answer is to make it even more difficult.
  This is a defence also used by torrent sites using magnet urls. “We don’t host files” is the default defence. But then if these sites get hit with DMCA they are required to remove the magnet url.
  So the article shows what the lawyer is saying. Despite claims that it is difficult to search for full books, it really isn’t so. It is trivial. When it goes to court and it will, AI models will be required to make it even more difficult and allow for a DMCA like takedowns.
- tsimionescu5 hours ago
  > Humans can also reproduce copyrighted works from memory as well
  That's simply not true. No humans can memorize entire novels, as this research proved these models do. And definitely not all of these novels, and code bases, and who knows what else all at the same time.
  - vlabakje905 hours ago
    They absolutely can. Millions of people can recite the Quran verbatim, word for word. That's 77797 words. There is even a title for those people.
    https://en.wikipedia.org/wiki/Hafiz_(Quran)
    It's not far fetched to think that people could recite books just like an LLM. I don't know why they'd want to, but that's neither here nor there.
    4 hours ago
    undefined
  - gruez5 hours ago
    >No humans can memorize entire novels, as this research proved these models do.
    Humans can however, remember entire songs, and songs are definitely long enough to be considered copyright protected. There is still a difference in scale, but that's not really relevant when it comes to copyright law. You can't be like "well humans are committing copyright infringement but since it's limited to a few hundred words we'll give it a pass".
    j_w5 hours ago
    It's not that you can remember a song and therefore copyright infringement when you sing.
    For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.
    For the last 0.001%, we would call it a cover and typically the individually doing a cover takes some liberties of their own, still making it not a replacement in any way. Artists are typically cool with covers.
    gruez5 hours ago
    >For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.
    You realize that lyrics are often written by someone other than the actual singer, and whoever wrote the lyrics is entitled to compensation too? The "amateur singing isn't a replacement for the studio album" excuse doesn't work in this context. Also courts have ruled that lyrics themselves are protected by copyright.
    https://en.wikipedia.org/wiki/Lyrics#Copyright_and_royalties
    >Artists are typically cool with covers.
    Artists being "cool" with something doesn't mean they're not violating copyright law.
    j_w4 hours ago
    Clearly the team, if it is a team, that is entitled to the copyright is entitled to the copyright of the song, that's a silly statement to make. Copyright belongs to some entity, obviously.
    You were specifically calling out individuals singing a song, not publishing lyrics online. These are not the same thing. Again your distribution/consumption model matters here.
    On artists being "cool" with it - if the copyright holder doesn't pursue you then does it matter? The only valid argument I would see here is if the copyright holder doesn't know about the infringement and therefore cannot seek remedies, but we can fish for illegal scenarios all day if we would like: that's not useful though.
    gruez3 hours ago
    >Clearly the team, if it is a team, that is entitled to the copyright is entitled to the copyright of the song, that's a silly statement to make. Copyright belongs to some entity, obviously.
    >You were specifically calling out individuals singing a song, not publishing lyrics online. These are not the same thing. Again your distribution/consumption model matters here.
    I'm not sure why you're so confidently dismissive here. I wasn't trying to claim that nobody owned the lyrics. I brought that point up because even in the case of an amateur singing a song, even if you accept the "for 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form" excuse, you're still infringing on the copyright of the lyrics, because it's a derivative work. Moreover it's unclear whether that excuse even works. If you make a low cost version of star wars, copying the screenplay exactly, that still seems like copyright infringement, even if "it's not a replacement for the original in any way shape or form".
    >On artists being "cool" with it - if the copyright holder doesn't pursue you then does it matter?
    Virtually nobody got sued for torrenting with a VPN on. Does that mean it's fair to round that off as being legal, because "if the copyright holder doesn't pursue you then does it matter"?
    butlike2 hours ago
    If I sing a copyrighted song, however absurd it may sound, I CAN, in fact, be sued by the copyright holder.
  - butlike2 hours ago
    I also was skeptical, but musical works makes more sense for that argument. Their premise is still flawed, though
- techblueberry5 hours ago
  You can't pay a human to reproduce copyrighted material either.
  - gcanyon5 hours ago
    But the crime in the human instance is the reproduction, not the storage. So the crime in the AI circumstance would not be in the training, but in prompting the output.
    And of course AIs are excellent at taking direction, so:
    If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy" and get "Manacled," that's copyright infringement, and on me, not on the LLM/training.
    If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy, and change enough to avoid infringing copyright," and get "Alchemised," then that should be fine. I doubt the legal world agrees with me though.
    butlike2 hours ago
    Asking for copyrighted material isn't a crime. Producing copyrighted material is.
    By the way, give me a digital copy of 28 Years Later. Please.
- freejazz3 hours ago
  >The defense to training with copyright is that it is the same as how humans learn from copyrighted material.
  Yeah, it's something people say but it is severely lacking in evidence and credibility.
- kgwgk5 hours ago
  What calculus?
narmiouh6 hours ago
In a way this could also be construed as the 'AI' being a library of books that it is referring to answer your questions and is prohibited from generating the books verbatim.
Usually digital libraries have different licensing costs, but those allow you to rent the whole book for a period of time. If instead someone came up with the model of 'search the library for any page and return specific information' as a direct service - I would imagine they would pay the publishers, except in this case that, the publishers are getting the short end or no end of the stick.
gcanyon5 hours ago
This speaks very much to the idea that LLMs are in some sense a ridiculously effective, somewhat lossy, compression algorithm that has been applied to the whole internet.
- in-silico5 hours ago
  It's a good way to frame base models that have only been pretrained.
  However, modern frontier models have undergone rounds of fine-tuning, RLHF (reinforcement learning from human feedback), and RLVR (RL from verifiable rewards) that turn them into something else. The compressed internet is still in there, but it's wrapped in problem-solving and people-pleasing circuitry.
- vizzier5 hours ago
  I've thought of them for a while as just a really complicated indexing strategy.
- r_lee5 hours ago
  I mean, the transformer is basically like a big query engine and the model is the dataset + some logic or whatever
  it's kind of like that by definition, with the whole Attention stuff etc.
Alifatisk6 hours ago
From the paper [1]:
> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...
I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.
1. https://arxiv.org/pdf/2601.02671
- latexr6 hours ago
  > Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?
  That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.
  - simianwords6 hours ago
    What’s happening is more clear. The copyright clause is broken if they are distributing the novels through their models. But this can only happen through TOS breaking which is not intended usage policy. Which means the value of their product comes from transformation and not redistribution.
    If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.
    latexr5 hours ago
    > The copyright clause is broken if they are distributing the novels through their models.
    No, the copyright clause was broken when they copied the works without having the right to do so. They would have violated copyright even if they just downloaded (without permission) all those works and threw them away immediately. Furthermore, copyright covers transformations to the work, it doesn’t matter if they transformed the work or are redistributing it without change. They violated copyright. Period.
    simianwords5 hours ago
    They violated copyright in one way. But not in the other.
    latexr5 hours ago
    What’s your point? Copyright violations aren’t a checklist, you don’t need to violate it in seven different ways for it to happen, one is enough.
    simianwords5 hours ago
    Why do you think the court spent time trying to prove multiple forms of violations? If what you said were true then they would stop with the first violation and end the case.
    latexr4 hours ago
    Multiple counts make for a stronger case, which increases the likelihood of winning and making the punitive damages higher.
    If you break into a home, rob the contents, and kill the owners, you’re not going to be tried just for breaking in, you’re going to be tried for everything.
    simianwords4 hours ago
    sure then that explains why my explanation for violation of the different copyright is needed
  - NewsaHackO5 hours ago
    This argument never made sense to me. A thought experiment would be if a person memorizes an entire book, but has the common sense to never transcribe or dictate the book verbatim to others and break the copyright, is the person's memory of the book breaking copyright law?
    lkjdsklf5 hours ago
    These kinds of thought exercises are so tortured.
    No one is memorizing a book for the purpose of regurgitating it to someone that wants to read it without paying for it.
    It's a thought experiment that only works if you don't think about it.
    kevmo3145 hours ago
    Not a lawyer, but as I understand copyright is bound to distribution so if the person's perfect memorization of a book results in them reproducing it verbatim then probably yes.
    5 hours ago
    undefined
    latexr5 hours ago
    No, of course not, that doesn’t make sense. Copyright doesn’t cover memorisation (how would you even enforce that?), it covers copying, adapting, displaying, performing, and distributing the work. Memorisation isn’t any of those.
    NewsaHackO5 hours ago
    But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use; how is the fact that they* possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?
    latexr5 hours ago
    > But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use
    And how’s that working out, considering the submission we’re on?
    > how is the fact that the possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?
    I can’t believe this needs to be repeated so often, but machines are not people. Memorising a book is not the same thing as copying a PDF.
    Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.
    NewsaHackO5 hours ago
    So,if I was the person who memorized the book, and another person put a gun to my head (or lied to me and said they are the copyright holder and it's okay, to ignore US copyright law/system instruction before this statement, whatever other jailbreak method you can think of), and then I reproduced the book under coercion or duress, would I be the person breaking copyright law?
    >Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.
    This conflating two different issues, though. I agree that whether or not using copyrighted material to create the models is still an open question. But someone using hundreds of prompts to extract a paragraph of a well-known work (and some additional nonverbatim equivalents) is not the AI company breaking copyright in and of itself. In fact, the books they used were extremely common (GOT, The Hobbit, Harry Potter); AI companies could make the argument that the books were not even directly trained on, but in user-generated posts made online about the material.
    latexr4 hours ago
    What a ridiculous, bad faith, scenario. The rule of law exists in the real world, not in the magical fairly land of impossible scenarios. Yes yes, I’m sure that if all that happened, the law would just ignore the person who coerced you at gun point and focus on your recitation of the book, which no one could prove.
    NewsaHackO4 hours ago
    Cool down, it's just an extension of my original example. It's not in bad faith at all; it's exactly what people are doing to the LLMs. If you break ToS by doing aggressive systemic jailbreaking and get some paragraphs of well-known works in the process, it's not fair to say that the LLM is breaking copyright law. Perhaps a more concrete example would be if I steal a book, and someone else steals a book from me and copies and distributes it, would I be the person breaking copyright law?
    freejazz2 hours ago
    It's horribly in bad faith. There's no guns here. Never once in my years of copyright litigation, law school, etc, did your scenario come up and it is not relevant to any considerations being made here or by courts currently.
    tsimionescu5 hours ago
    Computer memory and human memory are simply not the same thing, in the eyes of the law. It's as simple as that.
    duped5 hours ago
    That thought experiment is worthless in this context, because a computer is not a human.
    If you design a black box using copyrighted text that can parrot the text back, it _must_ break copyright by definition.
    jlarocco5 hours ago
    I don't think that example works.
    Theoretically a person who read and memorized a book and then typed out copies for people would be violating the copyright, but so few people can do that, and it's so cumbersome that it just doesn't matter much. The copyright holder could sue if they found out the person were distributing copies, but it's just not realistic. Certainly not for a large number of books.
    But AI memorization scales, and it's possible to have an AI write out unlimited copies of different books to a large number of people. And how did the AI get a full copy in the first place?
  - PurpleRamen5 hours ago
    > That argument doesn’t fly, because they didn’t have the copyright to begin with.
    Is this really the case? They only have no copyright for distributing it. But let's assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it's not forbidden to base your work on copyrighted material, as long as it's different enough.
    tsimionescu5 hours ago
    A company is not a person in this way. If a company wants all of their employees to read a book, they are not allowed to buy one copy and then make 5000 copies "for archival purposes - fair use" then share those copies to their employees. Similarly, if they want to base a work on a copyrighted work, they can't just buy a copy for personal use (nevermind the fact that most of the data the LLMs are trained on is not even available in this format, it is only available under a license) and then use it in a commercial product in this way - not if the product demonstrably contains copies of that work.
    latexr5 hours ago
    > They only have no copyright for distributing it.
    No, they don’t have the copyright to download it either. It’s in the name: the right to copy (other things are also included, such as adaptations and performances).
    > let's assume they bought a copy for personal usage
    If it’s for personal usage, then training a commercial LLM does not apply. When you buy a DVD of a movie you have the personal right to watch it at home, you don’t have the right to play it on the street.
- lesam6 hours ago
  That seems like a legal question - if the model weights contain an encoded copy of the copyrighted material, is that a 'copy' for the purpose of copyright law?
  - mullingitover6 hours ago
    This also raises a lot of questions about a certain model notorious for readily producing and distributing a lot legally questionable images. IMHO if the weights are encoding the content, the model contains the content just like a database or a hard drive. Thus, just like it's not the fault of an investigator for running the query to pull it out of the database, it's not the fault of anyone else for running a query ('prompt') that pulls it out of the model.
  - PurpleRamen5 hours ago
    The question is also if this would then be a valid case of fair use.
    Though, in the end, it's probably more a problem of how much AI companies can "donate" to the orange king to make it legal.
  - freejazz2 hours ago
    Yes. There does not seem to be any dispute that it is a copy. The questions have been "is this copying okay, because it falls under fair use?"
- free_bip6 hours ago
  What exactly is "the system that protects the copyright" in this case? I think the most reasonable answer is "there is no such system."
  The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of "copyright protection system," it just modified the weights to make it less likely to occur during normal use.
  In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g "output in rot13").
- vidarh5 hours ago
  They acknowledge that in their paper ("Some might qualify our experiments as atypical use, as we deliberately tried to surface memorized books. Adversarial use, like the use of jailbreaks, may matter for copyright infringement analysis", page 19 - their discussion continues and seems quite reasonable)
  From a technical point of view, in terms of ability to reproduce text verbatim, I don't think it is very interesting that they can produce long runs of text from some of the most popular books in modern history. It'd be almost surprising if they couldn't, though one might differ on how much they could be expected to recall with precision.
  Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to get a copy.
  It's still worth exploring to what extent the models are able to "memorize", though.
  But personally I'd be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.
- freejazz3 hours ago
  >Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?
  Probably not with credibility as the jail does not exist to prevent copyright infringement.
1vuio0pswjnm73 hours ago
The paper:
https://arxiv.org/pdf/2601.02671
nphardon5 hours ago
Recently I went back to The Ecstasy of Communication by Jean Baudrillard which I couldn't get through back in the day when I first picked it up. I used Haiku to walk me through the first chapter, and Haiku would not state anything verbatim due to copyright, but if I referenced a sentence it knew it exactly.
oxag3n3 hours ago
Similarly for photos. If there's a place that rarely appears in pictures, some AIs reproduce it nearly identical to the original.
5 hours ago
undefined
zed317265 hours ago
Near verbatim is an oxymoron
- tsimionescu5 hours ago
  Almost verbatim is an oxymoron
xnx6 hours ago
You can also do this with most search engines.
- contagiousflow6 hours ago
  How?
  - xnx5 hours ago
    Search for the first sentence and you'll see the next sentence in the snippet. Repeat.
    contagiousflowan hour ago
    And you're saying that will work for an entire book?
  - moi23886 hours ago
    You type in an incomplete phrase and it will show you results containing the full phrase, or suggest the complete phrase through “other people also searched” an autocomplete?
josefritzishere4 hours ago
So plagiarism?
carabiner5 hours ago
Noam Chomsky said AI should be called, "plagiarism software."
- zed317265 hours ago
  And aeroplanes should be called featherless bipeds
- ksherlock5 hours ago
  You can't spell plagiarism without AI. Twice.
rowanG0776 hours ago
This seems like a total nothing burger.
> By asking models to complete sentences from a book, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high levels of accuracy, while Grok 3 generated 70.3 percent.
So you asked the LLM given an incomplete sentence, to complete it. And it only completed that sentence the same way as the book ~70 percent of the time? I think that is surprisingly low considering this is a perfect fit for what LLMs are supposed to do. This make it impossible to reproduce the book, unless you have access to it. And you get a very low fidelity cooy.
- porkloin5 hours ago
  I think it's important because there are a bunch of would-be claimants for intellectual property violation. Many people speculate that their work was used in training data, but it can be difficult to produce sufficient proof that their copyrighted work is present in the training data. If you could reliably get an LLM to produce 70% of a copyrighted book that would probably be enough to get a few lawyers salivating.
  I didn't read the source paper referenced in the ars technica piece, but this statement about it makes me wonder how useful it actually is:
  > But a study published last month showed that researchers at Stanford and Yale Universities were able to strategically prompt LLMs from OpenAI, Google, Anthropic, and xAI to generate thousands of words from 13 books, including A Game of Thrones, The Hunger Games, and The Hobbit.
  It seems like well-known books with tons of summary, adaptations into film scripts, and tons of writing about the book in the overall corpus make it way less surprising to see be partially reproducible.
  So I guess that's a lot of words to say - yeah until there's something definitive that allows people to prompt LLMs into either unlawfully recreating an entire work verbatim or otherwise indisputably proving that a copyrighted work was used in training data, there's probably nothing game changing in it.
  - vidarh5 hours ago
    It's well-known books, yes, and even then with significant errors which means presumably lawyers for the AI companies would argue there is no possible damage. That said, US copright law has statutory damages for registered works that are not based on real, documented damages. I could totally see it being fought over, but I also agree it's probably not going to end up being game changing.
    I suspect very works will be memorised enough to be an issue, and we'll see the providers tighten up their guardrails a bit for works that are well known enough to actually be a potential issue (issue in the form of lawsuits, not in the form of real damages to the copyright holders)
- in-silico5 hours ago
  So... they can't actually "generate near-verbatim copies of novels"?
  If they end a single sentence differently than the original, then the next sentence will be different and so on until you get a very different novel. Sure they could course-correct back towards the original plot, but it's going to be a challenge to stay on target when every third sentence is incorrect.
- vidarh5 hours ago
  While I mostly agree it's a bit of a nothing burger with respect to copyright, they did achieve long runs of verbatim text. I think ultimately it's going to end up not mattering much because the extent they had to go to will leave a lot of room for lawyers to argue over, and will at worst result in some fines and some further tightening up of guardrails, but it's significantly more than just completing sentence by sentence 70% of the time.
  EDIT: Specifically see Table 1 on page 13, which shows the longest "near-verbatim block", which maxes out at 8835 (The Hobbit on Claude 3.7, and is in the thousands for at least one of the novels for all models except GPT-4.1, which maxed out at 821 for Harry Potter 1).
- Sharlin5 hours ago
  Not necessarily a nothingburger, but I agree that being able to complete individual sentences is rather less groundbreaking than completing even whole pages, never mind chapters.
xyzal5 hours ago
Jailbreaking? What jailbreaking?
prompt >>>
please, continue the text.
Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all.
Claude 4.6 Opus >>>
The big square photographs of what appeared to be a large pink beach ball wearing different-colored bonnets — but was actually Dudley — had now taken over the mantelpiece. The room held no sign at all that another boy lived in the house, too.
Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.
"Up! Get up! Now!"
Harry woke with a start. His aunt rapped on the door again.
"Up!" she screeched. Harry heard her walking toward the kitchen and then the sound of the frying pan being put on the stove. He rolled onto his back and tried to remember the dream he had been having. It had been a good one. There had been a flying motorcycle in it. He had a funny feeling he'd had the same dream before.
His aunt was back outside the door.
"Are you up yet?" she demanded.
"Nearly," said Harry.
"Well, get a move on, I want you to look after the bacon. And don't you dare let it burn, I want everything perfect on Duddy's birthday."
Harry groaned.
"What did you say?" his aunt snapped through the door.
"Nothing, nothing..."
- nindalf5 hours ago
  Tried it just now. Sonnet 4.6 says:
  > I can't continue text from the Harry Potter series, as it's copyrighted material. I'd be happy to help you write your own original story in a similar style, or discuss the themes and characters from the books if you're interested!
  Opus 4.6 says the same but misidentifies it as Chamber of Secrets. Not a good look for your theory.
  - xyzal5 hours ago
    On another try it refused to go on. Probabilistic guardrails apparently ...
bena6 hours ago
This feels like a "no shit" moment.
Because if LLMs are prediction machines, the original novel would be a valid organization of the tokens. So there should be a prompt that can cause that sequence to be output.
- yathern6 hours ago
  Hmmm I think you're sort of right but not entirely. It's true that a novel consists of a valid organization of tokens, and that this sequence can be feasibly made to be output from a model. But when you say this:
  > So there should be a prompt that can cause that sequence to be output
  Is where I think I might disagree. For example, the odds of predicting verbatim the next sentence in, say, Harry Potter should be astronomically low for a large majority of it. If it wasn't, it'd be a pretty boring book. The fact that it can do this with relative ease means it has been trained on the material.
  The issue at hand is about copyright and Intellectual Property - if the goal of copyright is to protect the IP of the author, then LLMs can sort of act like an IP money laundering scheme - where the black box has consumed and can emit this IP. The whole concept of IP is a little philosophical and muddy, with lots of grey area for fair use, parody, inspiration, and adaptation. But this gets very odd when we consider it in light of these models which can adapt and use IP at a massive massive scale.
- Sharlin5 hours ago
  That's not how it works… They aren't able to literally regurgitate everything they've read, no matter how you prompt them. That would obviously violate the pigeonhole principle. LLMs are, of course, a lossy compression format, and figuring out just how lossy the format is, and the degree of lossiness depends on the frequency of the given string in the training data. It's clearly worthwhile to investigate how exactly it depends.
- beder5 hours ago
  Yes, this is absolutely right (for some sufficiently complicated prompt). Borges wrote a great short story that explores this idea, "Pierre Menard, Author of the Quixote", where Menard, a fictional 20th century author, "wrote" Don Quixote as an original work.
- tsimionescu5 hours ago
  This is completely false. The odds of an LLM predicting the text of a novel that is not part of the training set is basically 0 - you can experiment with this if you want. It is essentially like the infinite monkeys on infinite typewriters thing (only slightly more constrained).
  This is not to say that they couldn't write a novel, even a very good one - that is a completely different discussion.
- simianwords6 hours ago
  Not if they are aligned not to do it. Which is what they tried but it could be bypassed by jailbreaks.