One of their engineers was able to recreate their platform by letting Claude Code reverse engineer their Apps and the Web-Frontend, creating an API-compatible backend that is functionally identical.
Took him a week after work. It's not as stable, the unit-tests need more work, the code has some unnecessary duplication, hosting isn't fully figured out, but the end-to-end test-harness is even more stable than their own.
"How do we protect ourselves against a competitor doing this?"
Noodling on this at the moment.
As engineers, we often think only about code, but code has never been what makes a business succeed. If your client thinks that their businesses primary value is in the mobile app code they wrote, 1) why is it even open source? 2) the business is doomed.
Realistically, though, this is inconsequential, and any time spent worrying about this is wasted time. You don't protect yourself from your competitor by worrying about them copying your mobile app.
If the platform is so trivial that it can be reverse engineered by an AI agent from a dumb frontend, what's there to protect against? One has to assume that their moat is not that part of the backend but something else entirely about how the service is being provided.
Its also US only. Other countries will differ. This means you can only rely on this ruling at all for something you are distributing only in the US. Might be OK for art, definitely not for most software. Very definitely not OK for a software library.
For example UK law specifically says "In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken."
AI can't be the author of the work. Human driving the AI can, unless they zero-shotted the solution with no creative input.
I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.
If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?
If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?
Is it just "AI" adjacent that isn't copyrightable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?
Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...
But for this type of copyright laundering, it doesn't really matter. The goal isn't really about licensing it, it's about avoiding the existing licence. The idea that the code ends up as public domain isn't really an issue for them.
They do something very similar for some of their work. It’s hard to use external services so they replicate them and the cost of doing so has come down from “don’t be daft, we can’t reimplement slack and google drive this sprint just to make testing faster” to realistic. They run the sdks against the live services and their own implementations until they don’t see behaviour differences. Now they have a fast slack and drive and more (that do everything they need for their testing) accelerating other work. I’m dramatically shifting my concept of what’s expensive and not for development. What you’re describing could have been done by someone before, but the difficulty of building that backend has dropped enormously. Even if the application was closed you could probably either now or soon start to do the same thing starting with building back to core user stories and building the app as well.
You can view some of this as having things like the application as a very precise specification.
Really fascinating moment of change.
I know it's a provoking question but that answers why a competitor is not a competitor.
DMCA. The EULA likely prohibits reverse engineering. If a competitor does that, hit'em with lawyers.
Or, if you want to be able to sleep at night, recognize this as an opportunity instead of a threat.
How do our competitors protect themselves against us doing this?
https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....
Question: if they had built one using AI teams in both “rooms”, one writing a spec the other implementing, would that be fine? You’d need to verify spec doesn’t include source code, but that’s easy enough.
It seems to mostly follow the IBM-era precedent. However, since the model probably had the original code in its training data, maybe not? Maybe valid for closed source project but not open-source? Interesting question.
It doesn't matter how they structure the agents. Since chardet is in the LLM training set, you can't claim any AI implementation thereof is clean room.
Might still be valid for closed source projects (probably is).
I think courts would need to weigh in on the open source side. There’s legal precedent is that you can use a derived work to generate a new unique work (the spec derived for the copyrighted code is very much a derived work). There are rulings that LLMs are transformative works, not just copies of training data.
LLMs can’t reproduce their entire training set. But this thinking is also ripe for misuse. I could always train or fine-tune a model on the original work so that it can reproduce the original. We quickly get into statistical arguments here.
It’s a really interesting question.
No, I don't think so. I hate comparing LLMs with humans, but for a human being familiar with the original code might disqualify them from writing a differently-licensed version.
Anyway, LLMs are not human, so as many courts confirmed, their output is not copyrightable at all, under any license.
If true, it would mean most commercial code being developed today, since it's increasingly AI-generated, would actually be copyright-free. I don't think most Western courts would uphold that position.
While it feels unlikely that a simple "write this spec from this code" + "write this code from this spec" loop would actually trigger this kind of hiding behaviour, an LLM trained to accurately reproduce code from such a loop definitely would be capable of hiding code details within the spec - and you can't reasonably prove that the frontier LLMs have not been trained to do so.
Edit: this is wrong
Also, it's weird that it's okay apparently to use pirated materials to teach an LLM, but maybe not to disseminate what the LLM then tells you.
All AI generated code is tainted with GPL/LGPL because the LLMs might have been taught with it
That is however stricter than what's actually legally necessary. It's just that the actual legal standard would require a court ruling to determine if you passed it, and everyone wants to avoid that. As a consequence there also aren't a lot of court cases to draw similarities to
This is actually harder standard than some people think.
The absolute clean room approaches in USA are there because they help short circuit a long lawsuit where a bigger corp can drag forever until you're broken.
However, the copyright system has always be a sham to protect US capital interests. So I would be very surprised if this is actually ruled/enforced. And in any case american legislators can just change the law.
If the code is different but API compatible, Google Java vs Oracle Java case shows that if the implementation is different enough, it can be considered a new implementation. Clean room or not.
I don't think this is a precedent either, plenty of projects changed licenses lol.
I keep kind mixing them up but the GPL licenses keep popping up as occasionally horror stories. Maybe the license is just poorly written for today's standards?
They usually did that with approval from existing license holders (except when they didn't, those were the bad cases for sure).
I rewrite it, my head full of my own, original, new ideas. The results turn out great. There's a few if and while loops that look the same, and some public interfaces stayed the same. But all the guts are brand new, shiny, my own.
Do I have no rights to this code?
But code that is any kind of derivative of code before it contains a complex mix of other peoples rights. It can be relicensed, but only if all authors large and small agree to the terms.
What if I decide to make a JS or Rust implementation of this project and use it as inspiration? Does that mean I'm no longer doing a "clean room" implementation and my project is contaminated by LGPL too?
Generally relicensing is done in good faith for a good reason, so pretty much everyone ok's it.
Trickiness can turn up when code contributors aren't contactable (ie dead, missing, etc), and I'm unsure of the legally sound approach to that.
I understand you need to publish the source code of your modifications, if you distribute them outside of your company.
Why does this new project here needed to replace the original like that in this dishonourable way? The proper way would have been to create a proper new project.
Note: even Python's own pip drags this in as dependency it seems (hopefully they'll stick to a proper version)
Be really careful who you give your projects keys to, folks!
2,305 files changed
+0 -546871 lines changed
https://github.com/chardet/chardet/commit/7e25bf40bb4ae68848...That is just the easiest way to disambiguate the legal situation (i.e. the most reliable approach to prevent it from being considered a derivative work by a court).
I'm curious how this is gonna go.
I don't think that the second sentence is a valid claim per se, it depends on what this "rewritten code" actually looks like (IANAL).
Edit: my understanding of "clean room implementation" is that it is a good defence to a copyright infrigement claim because there cannot be infringement if you don't know the original work. However it does not mean that NOT "clean room implementation" implies infrigement, it's just that it is potentially harder to defend against a claim if the original work was known.
As the LGPL says:
> A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".)
Is v7.0.0 a [derivative work](https://en.wikipedia.org/wiki/Derivative_work)? It seems to depend on the details of the source code (implementing the same API is not copyright infringement).
Especially now that ai can do this for any kind of intellectual property, like images, books or sourcecode. If judges would allow an ai rewrite to count as an original creation, copyright as we know it completely ends world wide.
Instead whats more likely is that no one is gonna buy that shit
The change log says the implementation is completely different, not a copy paste. Is that wrong?
>Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
Only after that would the burden be on the defendants, such as to give a defense that their usage is sufficiently transformative to qualify as fair use.
I’m not sure that “a total rewrite” wouldn’t, in fact, pass muster - depending on how much of a rewrite it was of course. The ‘clean room’ approach was just invented as a plausible-sounding story to head off gratuitous lawsuits. This doesn’t look as defensible against the threat of a lawsuit, but it doesn’t mean it wouldn’t win that lawsuit (I’m not saying it would, I haven’t read or compared the code vs its original). Google copied the entire API of the Java language, and got away with it when Oracle sued. Things in a courtroom can often go in surprising ways…
[edit: negative votes, huh, that’s a first for a while… looks like Reddit/Slashdot-style “downvote if you don’t like what is being said” is alive and well on HN]
This is not a good analogy.
A "rewrite" in context here is not a reproduction of the original work but a different work that is functionally equivalent, or at least that is the claim.
Like Tanya Grotter? https://en.wikipedia.org/wiki/Tanya_Grotter
“chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x”
Do people not write anymore?
What is this recent (clanker-fueled?) obsession to give everything fancy computer-y names with high numbers?
It's not a '12 stage pipeline', it's just an algorithm.
Do you know this kind of area and are commenting on the code?
Like, "we don't like copyright, but since you insist on enforcing it and we can't do anything against it, we will invent a clever way to use your own rules against you".
They are literally stealing from open source, but it's the original license that is the issue?