Relicensing with AI-Assisted Rewrite(tuananh.net)

133 pointsby tuananh6 hours ago30 comments

dspillett5 minutes ago
> Accepting AI-rewriting as relicensing could spell the end of Copyleft
The more restrictive licences perhaps, though only if the rewriter convinces everyone that they can properly maintain the result. For ancient projects that aren't actively maintained anyway (because they are essentially done at this point) this might make little difference, but for active projects any new features and fixes might result in either manual reimplementation in the rewritten version or the clean-room process being repeated completely for the whole project.
abrookewood2 hours ago
This seems relevant: "No right to relicense this project (github.com/chardet)" https://news.ycombinator.com/item?id=47259177
- shevy-java42 minutes ago
  That's another project though, right? In this case I think it is different because that project just seems stolen. The courts can probably verify this too.
  I think the main question is when a rewrite is a clean rewrite, via AI. If it is a clean rewrite they can choose any licence.
  - littlestymaar25 minutes ago
    No, TFA is about chardet:
    > chardet , a Python character encoding detector used by requests and many others, has sat in that tension for years: as a port of Mozilla’s C++ code it was bound to the LGPL, making it a gray area for corporate users and a headache for its most famous consumer.
kshri245 hours ago
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.
How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.
EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.
- kouteiheika4 hours ago
  > I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code.
  That license is called "All Rights Reserved", in which case you wouldn't be able to legally use the output for anything.
  There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.
  But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.
  - progval4 hours ago
    > There are research models out there which are trained on only permissively licensed data
    Models whose authors tried to train only on permissively licensed data.
    For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.
  - kshri244 hours ago
    I agree with your assessment. Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair not only because companies are using AI generated output but developers themselves are also paying and using AI generated output that is trained on other developer's input. I would feel good (in my conscience) that I am not "stealing" someone else's effort and they are being paid for it.
    carlob35 minutes ago
    Why settle on some private agreement between creators and ai companies where a tiny percentage is shared, let's just tax the hell out of AI companies and redistribute.
    kshri246 minutes ago
    > let's just tax the hell out of AI companies and redistribute.
    That's not what I favor because you are inserting a middleman, the Government, into the mix. The Government ALWAYS wants to maximize tax collections AND fully utilize its budget. There is no concept of "savings" in any Government anywhere in the World. And Government spending is ALWAYS wasteful. Tenders floated by Government will ALWAYS go to companies that have senators/ministers/prime ministers/presidents/kings etc as shareholders. In other words, the tax money collected will be redistributed again amongst the top 500 companies. There is no trickle down. Which is why agreements need to be between creators and those who are enjoying fruits of the creation. What have Governments ever created except for laws that stifle innovation/progress every single time?
    kouteiheikaan hour ago
    > Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair
    That wouldn't be fair because these models are not only trained on code. A huge chunk of the training data are just "random" webpages scraped off the Internet. How do you propose those people are compensated in such a scheme? How do you even know who contributed, and how much, and to whom to even direct the money?
    I think the only "fair" model would be to essentially require models trained on data that you didn't explicitly license to be released as open weights under a permissive license (possibly with a slight delay to allow you to recoup costs). That is: if you want to gobble up the whole Internet to train your model without asking for permission then you're free to do so, but you need to release the resulting model so that the whole humanity can benefit from it, instead of monopolizing it behind an API paywall like e.g. OpenAI or Anthropic does.
    Those big LLM companies harvest everyone's data en-masse without permission, train their models on it, and then not only they don't release jack squat, but have the gall to put up malicious explicit roadblocks (hiding CoT traces, banning competitors, etc.) so that no one else can do it to them, and when people try they call it an "attack"[1]. This is what people should be angry about.
    [1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...
  - foota3 hours ago
    I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.
    msdzan hour ago
    They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training. It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too.
    [1] https://arxiv.org/abs/2601.02671
- shevy-java40 minutes ago
  "We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."
  I think it will depend on the way HOW the AI arrived to the new code.
  If it was using the original source code then it probably is guilty-by-association. But in theory an AI model could also generate a rewrite if being fed intermediary data not based on that project.
- d1sxeyesan hour ago
  > We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not.
  That horse has bolted. No one knows where all the AI code any more, and it would no longer possible to be compliant with a ruling that no one can use AI generated code.
  There may be some mental and legal gymnastics to make it possible, but it will be made legal because it’s too late to do anything else now.
  - conartist64 minutes ago
    I hate that this may be true, but I also don't think the law will fix this for us.
    I think this is down the community and the culture to draw our red lines on and enforce them. If we value open source, we will find a way to prevent its complete collapse through model-assisted copyright laundering. If not, OSS will be slowly enshittified as control of projects slowly flows to the most profit-motivated entities.
- adrianN4 hours ago
  We‘ll have to wait until the technology progresses sufficiently that AI cuts into Disney’s profit.
- thedevilslawyer4 hours ago
  That's unpractical enough that you might as well wish for UBI and world peace rather than this.
  - kshri244 hours ago
    Why is it impractical? Github already has a sponsor system. Also this can be a form of UBI.
nairboon5 hours ago
That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.
Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.
- pavlov37 minutes ago
  If anything, the SCOTUS decision would seem to imply that generative AI transformations produce no additional creative contribution and therefore the original copyright holder has all rights to any derived AI works.
  (IANAL)
shevy-java44 minutes ago
> In traditional software law, a “clean room” rewrite requires two teams
So, I dislike AI and wish it would disappear, BUT!
The argument is strange here, because ... how can a2mark ensure that AI did NOT do a clean-room conforming rewrite? Because I think in theory AI can do precisely this; you just need to make sure that the model used does that too. And this can be verified, in theory. So I don't fully understand a2mark here. Yes, AI may make use of the original source code, but it could "implement" things on its own. Ultimately this is finite complexity, not infinite complexity. I think a2mark's argument is in theory weak here. And I say this as someone who dislikes AI. The main question is: can computers do a clean rewrite, in principle? And I think the answer is yes. That is not saying that claude did this here, mind you; I really don't know the particulars. But the underlying principle? I don't see why AI could not do this. a2mark may need to reconsider the statement here.
- dspillett14 minutes ago
  > how can a2mark ensure that AI did NOT do a clean-room conforming rewrite?
  In cases like this it is usually incumbent on the entity claiming the clean-room situation was pure to show their working. For instance how Compaq clean-room cloned the IBM BIOS chip¹ was well documented (the procedures used, records of comms by the teams involved) where some other manufacturers did face costly legal troubles from IBM.
  --------
  [1] the one part of their PCs that was not essentially off-the-shelf, so once it could be reliably legally mimicked this created an open IBM PC clone market
- titanomachy41 minutes ago
  The foundation model probably includes the original project in its training set, which might be enough for a court to consider it “contaminated”. Training a new foundation model without it is technically possible, but would take months and cost millions of dollars.
- __alexs38 minutes ago
  I think the problem here is that an AI is not a legal entity. It doesn't matter if you as individual run an AI that takes the source, dumps out a spec that you then feed into another AI. The legal liability lies with the operator of the AI, the original copyleft license was granted to a person, not to a robot.
  Now if you had 2 entirely distinct humans involved in the process that might work though.
emsign2 hours ago
By design you can't know if the LLM doing the rewrite was exposed to the original code base. Unless the AI company is disclosing their training material, which they won't because they don't want to admit breaking the law.
- shevy-java33 minutes ago
  > By design you can't know if the LLM doing the rewrite was exposed to the original code base.
  I agree, in theory. In practice courts will request that the decision-making process will be made public. The "we don't know" excuse won't hold; real people also need to tell the truth in court. LLMs may not lie to the court or use the chewbacca defence.
  Also, I am pretty certain you CAN have AI models that explain how they originated to the decision-making process. And they can generate valid code too, so anything can be autogenerated here - in theory.
- gostsamoan hour ago
  it was exposed when it was shown the thing to rewrite.
  - shevy-java32 minutes ago
    In this context here I think that is a correct statement. But I think you can have LLMs that can generate the same or similar code, without having been exposed to the other code.
- soulofmischiefan hour ago
  Seeing the source for a project doesn't prevent me from ever creating a similar project, just because I've seen the code. The devil is in the details.
  - shevy-java33 minutes ago
    Agreed, but the courts can conclude that all LLMs who are not open about their decision, have stolen things. So LLMs would auto-lose in court.
- skeledrewan hour ago
  It doesn't even matter if the LLM was exposed during training. A clean-room rewrite can be done by having one LLM create a highly detailed analysis of the target (reverse engineering if it's in binary form), and providing that analysis to another LLM to base an implementation.
  - k__an hour ago
    It doesn't matter for the LLM writing the analysis.
    It does matter for the one who implements it.
    Finding an LLM that's good enough to do the rewrite while being able to prove it wasn't exposed to the original GPL code is probably impossible.
  - xyzsparetimexyzan hour ago
    Why does it need 2 LLMs? LLMs aren't people. I'm not even sure that it needs to be done in 2 seperate contexts
    shevy-java32 minutes ago
    Agreed. But even then I don't see the problem. Multiple LLMs could work on the same project.
- d1sxeyesan hour ago
  Is it against the law for an LLM to read LGPL-licensed code?
  That’s a complex question that isn’t solved yet. Clearly, regurgitating verbatim LGPL code in large chunks would be unlawful. What’s much less clear is a) how large do those chunks need to be to trigger LGPL violations? A single line? Two? A function? What if it’s trivial? And b) are all outputs of a system which has received LGPL code as an input necessarily derivative?
  If I learn how to code in Python exclusively from reading LGPL code, and then go away and write something new, it’s clear that I haven’t committed any violation of copyright under existing law, even if all I’m doing as a human is rearranging tokens I understand from reading LGPL code semantically to achieve new result.
  It’s a trying time for software and the legal system. I don’t have the answers, but whether you like them or not, these systems are here to stay, and we need to learn how to live with them.
samrus4 hours ago
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.
Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?
- Sharlin2 hours ago
  The point is that even a work written by an AI trained exclusively on liberally licensed or public domain material cannot have copyright (isn’t a "work" in the legal sense) and thus nobody has standing to put it under a license it or claim any rights to it.
  If I train a limerick generator on the contents of Project Gutenberg, no matter how creative its outputs, they’re not copyrightable under this interpretation. And it’s by far the most reasonable interpretation of the law as both intended and written. Entities that are not legal persons cannot have copyright, but legal persons also cannot claim copyright of something made by a nonperson, unless they are the "creative force" behind the work.
- NitpickLawyer3 hours ago
  > To me it sounds like the AI-written work can not be coppywritten
  I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.
  If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?
  If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?
  Is it just "AI" adjacent that isn't copyrightable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?
  Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...
  - lelanthranan hour ago
    All of your questions have seemingly trivial answers. Maybe I am missing something, but...
    > If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count?
    Does the output of the macro depend on ingesting someone else's code?
    > Does code that generates other code count?
    Does the output of the code depend on ingesting someone else's code?
    > Protobuf?
    Does your protobuf implementation depend on ingesting someone else's code?
    > If it's the tool that generates the code, again where do we draw the line?
    Does the tool depend ingestion of of someone else's code?
    > Is it just using 3rd party tools?
    Does the 3rd party tool depend on ingestion of someone else's code?
    > Would training your own count?
    Does the training ingest someone else's code?
    > Would a "random" code gen and pick the winners (by whatever means) count?
    Does the random codegen depend on ingesting someone else's code?
    > Bruteforce all the space (silly example but hey we're in silly space here) counts?
    Does the bruteforce algo depend on ingesting someone else's code?
    > Is it just "AI" adjacent that isn't copyrightable?
    No, it's the "depends on ingesting someone else's code" that makes it not copyrightable.
    > If so how do you define AI?
    Doesn't matter whether it is AI or not, the question is are you ingesting someone else's code.
    > Does autocomplete count?
    Does the specific autocomplete in question depend on ingesting someone else's code?
    > Intellisense?
    Does the specific Intellisense in question depend on ingesting someone else's code?
    > Smarter intellisense?
    Does the specific Smarter Intellisense in question depend on ingesting someone else's code?
    ...
    Look, I see where you're going with this - reductio ad absurdum and all - but it seems to me that you're trying to muddy the waters by claiming that either all code generation is allowed or no code generation is disallowed.
    Let me clear the waters for all the readers - the complaint is not about code generation, it's about ingesting someone else's code, frequently for profit.
    All these questions you are asking seem to me to be irrelevant and designed to shift the focus from the ingestion of other people's work to something that no one is arguing against.
    NitpickLawyeran hour ago
    Interesting.
    > the complaint is not about code generation, it's about ingesting someone else's code, frequently for profit.
    Why do you think that is, and what complaint specifically? I was talking about this:
    > The Copyright Office reviewed the decision in 2022 and determined that the image doesn't include “human authorship,” disqualifying it from copyright protection
    There seems to be 0 mentioning of training there. In fact if you read the appeal's court case [1] they don't mention training either:
    > We affirm the denial of Dr. Thaler’s copyright application. The Creativity Machine cannot be the recognized author of a copyrighted work because the Copyright Act of 1976 requires all eligible work to be authored in the first instance by a human being. Given that holding, we need not address the Copyright Office’s argument that the Constitution itself requires human authorship of all copyrighted material. Nor do we reach Dr. Thaler’s argument that he is the work’s author by virtue of making and using the Creativity Machine because that argument was waived before the agency.
    I have no idea where you got the idea that this was about training data. Neither the copyright office nor the appeals court even mention this.
    But anyway, since we're here, let's entertain this. So you're saying that training data is the differentiator. OK. So in that case, would training on "your own data" make this ok with you? Would training on "synthetic" data be ok? Would a model that sees no "proprietary" code be ok? Would a hypothetical model trained just on RL with nothing but a compiler and endless compute be ok?
    The courts seem to hint that "human authorship" is still required. I see no end to the "... but what about x", as I stated in my first comment. I was honestly asking those questions, because the crux of the case here rests on "human authorship of the piece to be copyrighted", not on anything prior.
    [1] - https://fingfx.thomsonreuters.com/gfx/legaldocs/egpblokwqpq/...
    lelanthran22 minutes ago
    > There seems to be 0 mentioning of training there. In fact if you read the appeal's court case [1] they don't mention training either:
    > ...
    > I have no idea where you got the idea that this was about training data. Neither the copyright office nor the appeals court even mention this.
    In both the story and the comments, that's the prevailing complaint. FTFA:
    > Their claim that it is a “complete rewrite” is irrelevant, since they had ample exposure to the originally licensed code (i.e. this is not a “clean room” implementation). Adding a fancy code generator into the mix does not somehow grant them any additional rights.
    I mean, I know it's passe to read the story, but I still do it so my comments are on the story, not just the title taken out of context.
    > But anyway, since we're here, let's entertain this. So you're saying that training data is the differentiator.
    Well, that's the complaint in the story and in the comment section, so it makes sense to address that and that alone.
    > OK. So in that case, would training on "your own data" make this ok with you?
    Yes.
    > Would training on "synthetic" data be ok?
    If provenance of "synthetic data" does not depend on some upstream ingesting someone else's work, then yes.
    > Would a model that sees no "proprietary" code be ok?
    If the model does not depend on someone else's work, then Yes.
    > Would a hypothetical model trained just on RL with nothing but a compiler and endless compute be ok?
    Yes.
    *Note: Let me clarify that "someone else's work" means someone who has not consented or licended their work for ingestion and subsequent reproduction under the terms that AI/LLM training does it. If someone licensed you their work to train a model, then have at it.
    NitpickLawyer14 minutes ago
    Ah! I think I get where the confusion was. I was quoting something from another comment, and specifically commenting on that.
    > > To me it sounds like the AI-written work can not be coppywritten
    I was only commenting on that.
    user3428337 minutes ago
    I'm thinking that the relevant question would be whether the part where we want to know if is copyrightable is an intellectual invention of a human mind.
    "Ingesting someone else's code" does not seem very useful here - it's hardly quantifiable, nor is "ingestion" the key question I believe.
- __alexs31 minutes ago
  AI written absolutely is copyrightable. There are just some unresolved tensions around where the lines are and how much and what kind of involvement humans need to have in the process.
- laksjhdlka4 hours ago
  They say "if" it's a new work, then it might not be copyrightable, I guess. You suppose that it's still the original work, and hence it's still got that copyright.
  I think they are rhetorically asking if your position is correct.
dessimus16 minutes ago
Interesting to see how this plays out. Conceivably if running an LLM over text defeats copyright, it will destroy the book publishing industry, as I could run any ebook thru an LLM to make a new text, like the ~95% regurgitated Harry Potter.
- timschmidt11 minutes ago
  This has already been done via brute force for melodies: https://www.vice.com/en/article/musicians-algorithmically-ge...
  - amelius8 minutes ago
    Did they listen to their own creation?
    If not, maybe it should not constitute a valid case in court.
    Also, I'm wondering if they are not themselves liable considering they have every copyrighted work in there too.
- kingstnap8 minutes ago
  You could already do that before LLMs?
  Persumably there is already a law around why I cant just go borrow a book from my library, type out some 95% regurgitated varient on my laptop, and then try to publish it somewhere?
  Edit: I looked it up and the thing that stops you from publishing a bootleg "Harold Potter and the Wizards Rock" is this legal framework around "The Abstractions Test".
- amelius11 minutes ago
  If enough people do this, then it may speed up the lawmaking process.
amelius22 minutes ago
I think you should interpret it like this:
You cannot copyright the alphabet, but you can copyright the way letters are put together.
Now, with AI the abstraction level goes from individual letters to functions, classes, and maybe even entire files.
You can't copyright those (when written using AI), but you __can__ copyright the way they are put together.
- josephg3 minutes ago
  > You can't copyright those anymore (when written using AI), but you __can__ copyright the way they are put together.
  Sort of, but not really. Copyright usually applies to a specific work. You can copyright Harry Potter. But you can't copyright the general class of "Wizard boy goes to wizard school". Copyrights generally can't be applied to classes of works. Only one specific work. (Direct copies - eg made with a photocopier - are still considered the same work.)
  Patterns (of all sorts) usually fall under patent law, not copyright law. Patents have some additional requirements - notably including that a patent must be novel and non-obvious. I broadly think software patents are a bad idea. Software is usually obvious. Patents stifle innovation.
  Is an AI "copy" a copy like a photocopier would make? Or is it a novel work? It seems more like the latter to me. An AI copy of a program (via a spec) won't be a copy of the original code. It'll be programmed differently. Thats why "clean room reimplementations" are a thing - because doing that process means you can't just copy the code itself. But what do I know, I'm not a lawyer or a judge. I think we'll have to wait for this stuff to shake out before anyone really knows what the rules will end up being.
  Weird variants of a lot of this stuff have been tested in court. Eg the Google v Oracle case from a few years ago.
an hour ago
undefined
mfabbri775 hours ago
This has the potential to kill open source, or at least the most restrictive licenses (GPL, AGPL, ...): if a license no longer protects software from unwanted use, the only possible strategy is to make the development closed source.
- abrookewood2 hours ago
  It's not just open source, it is literally anything source-available, whether intentional or not.
- _dwt4 hours ago
  Yes, this is the reason I've completely stopped releasing any open-source projects. I'm discovering that newer models are somewhat capable of reverse-engineering even compiled WebAssembly, etc. too, so I can feel a sort of "dark forest theory" taking hold. Why publish anything - open or closed - to be ripped off at negligible marginal cost?
  - Tiberium4 hours ago
    People are just not realizing this now because it's mostly hobby projects and companies doing it in private, but eventually everyone will realize that LLMs allow almost any software to be reverse engineered for cheap.
    See e.g. https://banteg.xyz/posts/crimsonland/ , a single human with the help of LLMs reverse engineered a non-trivial game and rewrote it in another language + graphics lib in 2 weeks.
  - abrookewood2 hours ago
    Why does it matter if it is 'ripped off' if you released it as open source anyway? I get that you might want to impose a particular licence, but is that the only reason?
  - seddonm14 hours ago
    It’s a real problem. I threw it at an old MUD game just to see how hard it is [0] then used differential testing and LLMs to rewrite it [1]. Just seems to be time and money.
    [0] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
    [1] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
- user342832 hours ago
  I find the wording "protect from unwanted use" interesting.
  It is my understanding that what a GPL license requires is releasing the source code of modifications.
  So if we assume that a rewrite using AI retains the GPL license, it only means the rewrite needs to be open source under the GPL too.
  It doesn't prevent any unwanted use, or at least that is my understanding. I guess unwanted use in this case could mean not releasing the modifications.
  - mfabbri7729 minutes ago
    If the AI product is recognised as "derivative work" of a GPL-compliant project, then it must itself be licensed under the GPL. Otherwise, it can be licensed under any other license (including closed source/proprietary binary licenses). This last option is what threatens to kill open source: an author no longer has control over their project. This might work for permissive licenses, but for GPL/AGPL and similar licenses, it's precisely the main reason they exist: to prevent the code from being taken, modified, and treated as closed source (including possible use as part of commercial products or Sass).
zozbot2344 hours ago
If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out.
- gf0004 hours ago
  I guess it depends on if the source data set is part of the training data or not (if it's open source it is likely part of it).
  A lawyer could easily argue that the model itself stores a representation of the original, and thus it can never do a "fresh context".
  And to be perfectly honest, LLMs can quote a lot of text verbatim.
- k__an hour ago
  How do you prove the training data didn't contain the code?
  I'd assume an LLM trained on the original would also be contaminated.
- miroljub4 hours ago
  The new agent who writes code has probably at least parts of the original code as training data.
  We can't speak about clean room implementation from LLM since they are technically capable only of spitting their training data in different ways, not of any original creation.
  - dizhn2 hours ago
    The conclusion of this would be that you can never license AI generated code since you can't get a release from the original authors.
    Of course in practice it would work exactly in the opposite fashion and AI generated code would be immune even if it copied code verbatim.
    jesterswilde2 hours ago
    I don't see what's wrong with that personally. If I pirated someone's software, and then sold it as my own and got caught, just because I sold a bunch of it doesn't mean those people who bought it now are in the clear. They are still using bootleg software in their business.
  - nubg3 hours ago
    Only in the case of open source code
Retr0id5 hours ago
> In traditional software law, a “clean room” rewrite requires two teams
Is the "clean room" process meaningfully backed by legal precedent?
- karlding5 hours ago
  I am not a lawyer, but from my understanding the legal precedent is NEC v. Intel which established that clean-room software development is not infringing, even if it performs the same functionality as the original.
  As an aside, this clean room engineering is one of the plot points of Season 1 of the TV show Halt and Catch Fire where the fictional characters do this with the BIOS image they dumped.
- Firehawke5 hours ago
  Sure. The reimplementation of the IBM PC BIOS that gave birth to IBM Compatibles is the canonical example.
- estimator72925 hours ago
  Yes. Compaq's reverse engineering of the IBM PC BIOS set the precedent.
- devmor4 hours ago
  It is the reason AMD exists.
Tomte4 hours ago
> The original author, a2mark , saw this as a potential GPL violation
Mark Pilgrim! Now that‘s a name I haven‘t read in a long time.
anilgulecha5 hours ago
This is precedent setting. In this case the rewrite was in same language, but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?
If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.
- yjftsjthsd-h3 hours ago
  > but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?
  Isn't that what https://github.com/uutils/coreutils is? GNU coreutils spec and test suite, used to produce a rust MIT implementation. (Granted, by humans AFAIK)
- mlaretallack5 hours ago
  Its very important to understand the "how" it was done. The GPL hands the "compile" step, and the result is still GPL. The clean Room process uses 2 teams, separated by a specification. So you would have to
  1. Generate specification on what the system does. 2. Pass to another "clean" system 3. Second clean system implements based just on the specification, without any information on the original.
  That 3rd step is the hardest, especially for well known projects.
  - microtonal4 hours ago
    So what if a frontier model company trains two models, one including 50% of the world's open source project and the second model the other 50% (or ten models with 90-10)?
    Then the model that is familiar with the code can write specs. The model that does not have knowledge of the project can implement them.
    Would that be a proper clean room implementation?
    Seems like a pretty evil, profitable product "rewrite any code base with an inconvenient license to your proprietary version, legally".
    anilgulecha4 hours ago
    LLM training is unnecessary in what we're discussing. Merely LLM using: original code -> specs as facts -> specs to tests -> tests to new code.
  - anilgulecha4 hours ago
    1 is claude-code1, outputs tests as text.
    2. Dumped into a file.
    3. claude-code that converts this to tests in the target language, and implements the app that passes the tests.
    3 is no longer hard - look at all the reimplementations from ccc, to rewrites popping up. They all have a well defined test suite as common theme. So much so that tldraw author raised a (joke) issue to remove tests from the project.
- nairboon5 hours ago
  No, GPL still holds even if you transform the source code from one language to another language.
  - anilgulecha5 hours ago
    That why I carved it out to just the specs. If they can be read as "facts", then the new code is not derived but arrived at with TTD.
    The thesis I propose is that tests are more akin to facts, or can be stated as facts, and facts are not copyright-able. That's what makes this case interesting.
    nairboon4 hours ago
    I assumed that "tests" refers to a program too, which in this example is likely GPL. Thus GPL would stick already on the AI-rewrite of GPL test code.
    If "tests" should mean a proper specification let's say some IETF RFC of a protocol, then that would be different.
    anilgulecha4 hours ago
    Yes, I had not specified in my original comment. But in the SOTA LLM world code/text boundary is so blurry, so as to be non-existent.
gbuk20132 hours ago
In mind, if you feed code into an AI model then the output is clearly a derivative work, with all the licensing implications. This seems objectively reasonable?
- quotemstr10 minutes ago
  Nobody in this discussion knows what the words "derivative" and "work" mean individually, much less together
tgma33 minutes ago
Isn't AFC test applicable here?
DrammBA5 hours ago
I like the idea of AI-generated ~code~ anything being public domain. Public data in, public domain out.
- lejalv5 hours ago
  This could be read as a reformulation of the old adage - "what's mine is mine, and what is yours, is mine too".
  So, you can pilfer the commons ("public") but not stuff unavailable in source form.
  If we expand your thought experiment to other forms of expression, say videos on YT or Netflix, then yes.
- kshri245 hours ago
  I don't think you can classify "public data in" as public domain. Public data could also include commercial licenses which forbid using it in any way other than what the license states. Just because the source is open for viewing does not necessarily mean it is OSL.
  That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways.
  - thedevilslawyer5 hours ago
    Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion.
    We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.
    kshri244 hours ago
    > LLM ingestion comes under fair use
    I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.
    thedevilslawyer4 hours ago
    https://www.skadden.com/insights/publications/2025/07/fair-u...
    Both Meta and Anthropic were vindicated for their use. Only for Anthropic was their fine for not buying upfront.
    shakna2 hours ago
    Alsup absolutely did not vindicate Anthropic as "fair use".
    > Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. [0]
    It was only fair use, where they already had a license to the information at hand.
    [0] https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
    kshri244 hours ago
    This hasn't gone to Supreme Court yet. And this is just USA. Courts in rest of the World will also have to take a call. It is not as simple as you make it out to be. Developers are spread across the World with majority living outside USA. Jurisdiction matters in these things.
    thedevilslawyer3 hours ago
    Copyright's ambit has been pretty much defined and run by US for over a century.
    You're holding out for some grace on this from the wrong venue. The right avenue would be lobbying for new laws to regulate and use LLMs, not try to find shelter in an archaic and increasingly irrelevant bit of legalese.
    kshri242 hours ago
    I don't disagree. However, just because your assertion of copyright being initially defined by US (which is not the fact. It was England that came up with it and was adopted by the Commonwealth which US was also a part of until its independence) does not mean jurisdiction is US. Even if US Supreme Court rules one way or the other, it doesn't matter as the rest of the World have its own definitions and legalese that need to be scrutinized and modernized.
    gf0004 hours ago
    There are hardly any rulings/laws about the topic, and it quite obviously changes the picture of licenses.
- benob5 hours ago
  What about doing that with movies and music?
  - zodmaner4 hours ago
    The results would be the same: AI generated music and movies will be public domain.
    nkmnz2 hours ago
    So you’d lose all rights on pictures of yourselves if they were generated by AI? Would this be true even for nudes?
pu_pe4 hours ago
Licensing issues aside, the chardet rewrite seems to be clearly superior to the original in performance too. It's likely that many open source projects could benefit from a similar approach.
Cantinflasan hour ago
"If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. "
Software in the AI era is not that important.
Copyleft has already won, you can have new code in 40 seconds for $0.70 worth of tokens.
- p0w3n3dan hour ago
  Just take the code and let it AI rewrite. But... AI was taught on all the OpenSource Code available. Lot of them were GPL I think... So...
- WesolyKubeczek19 minutes ago
  Let’s then abolish all copyright on all software, what ever could go wrong?
foota5 hours ago
I think the more interesting question here would be if someone could fine tune an open weight model to remove knowledge of a particular library (not sure how you'd do that, but maybe possible?) and then try to get it to produce a clean room implementation.
- benob4 hours ago
  I don't think this would qualify as clean room (the Library was involved in learning to generate programs as a whole). However, it should be possible to remove the library from the OLMO training data and retrain it from scratch.
  But what about training without having seen any human written program? Coul a model learn from randomly generated programs?
  - foota3 hours ago
    > I don't think this would qualify as clean room (the Library was involved in learning to generate programs as a whole)
    Hm... I mean this is really one for the lawyers, but IMO you would likely successfully be able to argue that the marginal knowledge of general coding from a particular library is likely close to nil.
    The hard part here imo would be convincingly arguing that you can wipe out knowledge of the library from the training set, whether through fine tuning or trying to exclude it from the dataset.
    > But what about training without having seen any human written program? Coul a model learn from randomly generated programs?
    I think the answer at this point is definitely no, but maybe someday. I think it's a more interesting question for art since it's more subjective, if we eventually get to a point where a machine can self-teach itself art from nothing... first of all how, but second of all it would be interesting to see the reaction from people opposed to AI art on the basis of it training off of artists.
    Honestly given all I've seen models do, I wouldn't be too surprised if you could somehow distill a (very bad) image generation model off of just an LLM. In a sense this is the end goal of the pelican riding a bicycle (somewhat tongue in cheek), if the LLM can learn to draw anything with SVGs without ever getting visual inputs then it would be very interesting :)
skeledrewan hour ago
Looks like copyright just died.
- cedwsan hour ago
  *for ordinary people. If you use AI to steal from rich and powerful people, expect the law to come down on you like a tonne of bricks. If you steal from authors, artists, and developers no worries.
gspr2 hours ago
> If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT. The legal and ethical lines are still being drawn, and the chardet v7.0.0 case is one of the first real-world tests.
This isn't even limited to "the end of copyleft"; it's the end of all copyright! At least copyright protecting the little guy. If you have deep enough pockets to create LLMs, you can in this potential future use them to wash away anyone's copyright for any work. Why would the GPL be the only target? If it works for the GPL, it surely also works for your photographs, poetry – or hell even proprietary software?
blamestross2 hours ago
Intellectual property laundering is the core and primary value of LLMs. Everything else is "bonus".
duskdozer36 minutes ago
This is such scummy behavior.
verdverm5 hours ago
Interesting questions raised by recent SCOTUS refusal to hear appeals related to AI an copyright-ability, and how that may affect licensing in open source.
Hoping the HN community can bring more color to this, there are some members who know about these subjects.
est4 hours ago
Uh, patricide?
The key leap from gpt3 to gpt-3.5 (aka ChatGPT) was code-davinci-002, which is trained upon Github source code after OpenAI-Microsoft partnership.
Open source code contributed much to LLM's amazing CoT consistency. If there's no Open Source movement, LLM would be developed much later.
4 hours ago
undefined
himata41135 hours ago
I mean in my opinion GPL licensed code should just infect models forcing them to follow the license.
You can do this a lot by saying things like: complete the code "<snippet from gpl licensed code>".
And if now the models are GPL licensed the problem of relicensing is gone since the code produced by these models should in theory be also GPL licensed.
Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.
- kshri244 hours ago
  > Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.
  Can you point to the clause? I have never seen it in any GPL license.
  - himata41133 hours ago
    it's the general copyright protection 'law' fair use and all that, varies by country tho.
spwa43 hours ago
Can we do the same with universal music? Because that's easy and already possible. Or Microsoft Windows? Because we all know the answer: if it works, essentially any government will immediately call it illegal.
Because if this isn't allowed, that makes all of the AI models themselves illegal. They are very much the product of using others' copyrighted stuff and rewriting it.
But of course this will be allowed because copyright was never meant to protect anyone small. And that it's in direct contradiction with what applies to large companies? Courts won't care.
- gspr2 hours ago
  The dark future possibility here is that the big guy is allowed to launder the intellectual property of the little guy, but not vice versa.
  - vetroman hour ago
    That dark future is now, look at case law as applied to the AI operators vs the 'little guys'.
    spwa4an hour ago
    Even big copyright firms. Disney especially is known for rehashing existing material and then not allowing anyone else to do the same with their stuff. Disney does not have a lot of original stories.