It usually goes something like: If I can make money by learning something from a web page, why does a computer making money by learning everything from everyone upset people so? It’s the same thing!
It’s like if I go to Golden Gate Park and pick one flower, I shouldn’t do that, but no one cares. But if I build a machine to automatically cut every flower in the park because I want to sell them, that’s different.
“You say I can pick one flower, but you get upset when I take a bunch. That’s inconsistent. Check and mate.”
But quantitative changes in an activity produce qualitative changes. Everyone knows this, but sometimes they seem to find it inconvenient to admit it. Not that effects of the qualitative change are always bad, but they are often different, and worth considering rather than dismissing.
If one word is stolen by AI, that's bad. If a million words are stolen by AI, that's business.
That doesn't work anymore. Google provides AI generated summary, nobody looks at the original site.
We found our data in the outputs of their models but who can do anything about it...
If the crawlers refuse to voluntarily respect your robots.txt, then you are well within your rights to poison their data.
Sue for $180,000 per infringement which should be calculated for each illegal API call.
Unauthorized access, system damage, and maybe even extortion all apply here.
well, at least in the case of google, I'm pretty sure that's the point. Or at least, they are doing things that would seem to be moving towards being an oracle with all the answers and not the signpost that points you in the right direction. The destination rather than the gateway.
These AI companies are really just a gross example of the motto "Socialize the costs, privatise the profits". It's disgusting!
I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.
Most legit search engines are going to honor robots.txt and you can disallow access.
Next level would be using something like rate limiting controls and/or Cloudflare's bot fight mode to start blocking the bad bots. You start to annoy some people here.
Next would be putting the content behind some form of auth.
That being said you would require your user to download a compatible browser for gemini/gopher.
Even when we do actually put physical locks on things they are mostly there to show that someone breaking in did so intentionally and not at all designed to prevent motivated attackers.
Where do you live? In the US it’s actually illegal for anyone except the USPS to deliver to a mailbox.
Also this has gotten pretty far away from the web scraping scenario. There’s no door accidentally opening here.
> Although Anubis could be altered to mine cryptocurrency to serve as proof of work, Iaso has rejected this idea: "I don't want to touch cryptocurrency with a 20 foot pole."
Which in my mind is a shame. Crypto is an absolute mess, yes, but this seems like an elegant way to get something back for putting things out there.
Between seeing ads and doing a little bit of proof-of-work for the author, I'd choose the latter.
What's even crazier to think about is that to use the latest versions of these models for which you supplied training data, you have to pay hundreds of dollars a month. I would love to get a settlement check proportional to my model weights. Even if it's $0.10, at least everyone out there will get what they're owed.
I do not value copyright. All it does is give you standing to sue if somebody reproduces your work. It does not differentiate or account for parallel creation. I cannot count how many times I have "created" something, only to find it in a research paper later.
Part of the reason I think copyright has no value is that, in general, individual copyright owners don't have the deep pockets necessary to sue someone who violates their copyright. If anyone is violating the spirit of copyright, it's corporations that insist you assign your work over to them as a work for hire, or outright ignore your copyright. (looking at you, Disney's Atlantis).
A significant benefit of AI that doesn't get talked about enough is that AI has a much greater reach over all the information it was trained on and can draw connections that would be invisible to someone operating at the human scale.
Ad blocking has always been a problem for creators but it's aimed at big corps - non-creators. The creators asked people to support them other ways or turn off the blocking. And it's not like the little independent creators wanted this version of commercialized internet in the first place.
The ai marketing teams are spinning everything they can but no AI companies are the conscript, the vultures. No question about it.
The number of people who will not ever load your ads is around 30%.
I can tell you that creators talk about this a lot in private, but will not publicly because the internet has a mass delusion on how creation and compensation works. It's like trying to convince christians that jesus obviously didn't come back from the dead days later, depsite there being no logical system available that would explain it.
If we were to try and map out a functional internet where everyone wins, users and creators, there is no example where ad blocking is anything other net harmful. You either get volunteer net where 0.01% share hobby posts on their own dime for the other 99.9% or you get IRC where 99% of the population doesn't really benefit (ala 1993).
People can easily justify their own piracy because it’s small scale. Even when they organize, create a whole software and tooling ecosystem around pirating media to stick into jellyfin or plex. AI still did it bigger and worse and is bad, what I’m doing is not so bad because I wasn’t going to buy the movie anyway, etc.
It's in no way, shape, or form "small scale", and has fundamentally changed the the very nature of the internet for the worse (opinions/views of ad blocking people don't matter).
There is no viable model where "have stuff but not pay for it" works out.
Many of the websites I read do not collect any appreciable amount of money from ads, or have no ads at all (one example: news.ycombinator.com :) ). They want a recognition, or to share the knowledge, or community, or they are building their brand... And AI is destroying this all - the first result of "zx80" is an AI overview with a link to wikipedia and some youtube videos. If person stops there , they will never get to computinghistory.org.uk link, and won't see any related information about the variants and models.
When you click "news.ycombinator.com" you are clicking on the ad.
:)
1. LLM/transformer technology is legitimately amazing and revolutionary. 2. In the end, they function as an enormous, effective database for most human knowledge.
Point 1 obscures the fact that if someone just created an SQL database with every digital artifact in existence and provided it for free upon request, there would be no ambiguity whether that was legal or not.
But distillation, etc obscures this relationship and it looks like something other than straight lookup, at least in part because it is obviously more than that.
I don't even think this is true. We just didn't know how simple this all is. We just found out because we now have the compute power.
The argument, as I understand it is that the "theft" is in quotes because it's not literally copyright infringement, but fair use of an old public-domain folk tale that ends up consuming the latter.
Today, when kids know "Aladdin" they know the copyrighted/trademarked Disney character, not the traditional folk tale- that's the "theft" that happened.
Disney made a cartoon of the story without understanding the culture it comes from with the main purpose of selling it to an audience with an even less understanding. And the results was a horrible misrepresentation of somebody else’s cultural heritage.
https://en.wikipedia.org/wiki/The_death_of_one_man_is_a_trag...
Lord of the rings will be under copyright til roughly 2050. I think Tolkien's estate has gotten more than enough money from that book and it's time to let other use the word hobbit without the threat of a lawsuit.
I expect it would not move the needle much. I support reduced copyright periods, though not in the specific way you do. But that's not what we're talking about here, is it? The comment I replied to seemed to be advocating for total abolition of copyright law, and my comment is written to be interpreted in that context.
> To the point that most people will never be legally allowed to directly build off of the culture they grew up in.
What specifically are you talking about? Every author borrows from what came before. Copyright law doesn't even enter the picture in the vast majority of cases, because you generally don't have to copy to "build off of the culture [you] grew up in".
Even before AI more people tried to be an author/musician than could ever hope to gain even financial success. I don’t think less copyright will dissuade them.
> every author borrows
Borrows yes. But that has changed drastically in the last 100 years because of what has become the copyright system.
I’ll be long dead and gone before people can make and publish their own LOTR, or Star Wars, or whatever franchise they grew up with. Disney would be impossible to start given the current regulations, all those tales would be locked up, and we would all be worse for it.
Without copyright, nothing stops one from simply selling a book under their own name.
Big publishers could just reprint anything and get it into brick & mortar stores. No money for authors.
Advocating for absolutely no copyright is wild.
Citation needed, as well as your precise definition of "worthwhile".
> Even if they are not enjoyable.
Huh?
> The dissemination of ideas from an activist perspective is uninhabitable
Yes, I understand that anti-copyright activists want to abolish copyright.
In reality most art is done because the artist has something to say, and the money they get from it is only motivating in as much as it enables the artist to do more art. So I would guess in a world without copyright protection we would just find other ways to pay artists and a very similar amount of art would be produced.
You can see an example of this e.g. in Iceland where the market is way to small for art aimed at the domestic market to make enough money solely by selling it (possible with music; rare with books; not possible with movies). Instead the state has an extensive “artist salary“ program, which pays artist regardless of how well the art they produce sells. Unsurprisingly Iceland produces a lot of art and has many working artists.
How do you explain the creative works of writing, music, and art that existed in the millennia of human history between the Mesopotamians and the Enlightenment era?
Open source actually demonstrates that copyright serves a purpose. There are still customers for non-open software, even when open alternatives exist, so the ability to monetize brings new offerings to the economy.
Or are you suggesting open source software is public domain?
Are we going the communist soviet union route where everything is decided by central committee?
Those of us who create for creation's sake need no other reason. I create because I want to, not because I want to use it to gain capital.
Sure, those lines get muddy when you want to do it professionally, but that's a separate argument.
How do you create without capital? To make a film you need a camera crew, a sound crew, set designers, caterers, a director, scriptwriters. A world without professional creatives is so much poorer than the world we already have. Why would you give it up just for some vague notion of ideological purity.
Would you be able to create big-budget movies without said big budget? Of course not. I obviously like some of those too, but who's to say that the larger budget made them better? It feels like you're conflating art creation with art business, but they are not the same thing.
>I obviously like some of those too, but who's to say that the larger budget made them better?
If you legitimately believe something like 2001: A Space Odyssey would be as good with a budget of $10,000 then that just seems delusional.
The world you want is one in which the only people who can create things are people who are wealthy by other means, there is no pathway for a talented but poor kid to go from making home movies to working on films without IP laws. They must abandon their dreams and go work in the coal mines or whatever. It is dystopian.
I want the most amount of people possible to be able to work as professional creatives because it enriches my life and the lives of everyone in the country I live in.
You do realize people created and shared things long before copyright became a thing, right?
Like if we know formulation of drug then drug (+ any smaller modification - through AI) could be new formulation. That will break current Medical patent system.
For lots of online knowledge/blogs I guess it is true but even here I often read explainer blogs because AI casts everything in a certain narrative/tone that isn’t always appropriate.
As a teenager I used to proclaim that "you can't own bits, maaaan" all the time. I've since grown up. Intellectual property is essential to safeguarding intellectual work. I'm not saying this out of greed – I'm a vocal advocate for the free software movement. It, too, relies on a semi-sane framework of intellectual property. So do Hollywood studios. So do the makers of AI (well, since they're not actually sustainable at all currently, I guess you can say they don't rely on anything).
If you’re a pleb, stealing copyrighted materials will get you some nasty fines, lawsuits and criminal charges. If you’re a megacorp with unlimited buckets of cash, then there is no accountability.
You can't steal or profit off of that data, but it's fine for them for whatever reason. I guess because they're a force for good in the world and are pushing humanity forward eh?
“No one is surprised, jackass, it’s just adults having a conversation about the current state of affairs.”
Yes, it’s tiring and rarely contributes positively to the conversation.
Because the sources are now polluted with AI. That's at least one reason they stop scraping.
The reason is quite simple. When Microsoft steals YOUR work, GDP go up. When YOU steal Microsoft's work, GDP go down. And the people who create and enforce our laws want GDP to go up. To these people morality and rights are a thin guise that can be conveniently discarded when it's invonvenient for them.
the reason is crony capitalism. I wish I knew what the fix was
Then DeviantArt built a tool to automate the "make a similar image yourself" part and here we are. It removed all the fun parts: the personal contact, the attribution, the inspiration.
Artists realized they unwittingly contributed to the death of not only the community, but the art form they love. Lawsuits pending.
This is pretty much the exact claim of a NYT lawsuit against OpenAI.
"One example: Bing Chat copied all but two of the first 396 words of its 2023 article “The Secrets Hamas knew about Israel’s Military.” An exhibit showed 100 other situations in which OpenAI’s GPT was trained on and memorized articles from The Times, with word-for-word copying in red and differences in black."
https://www.hollywoodreporter.com/business/business-news/cou...
You can get it to reproduce content but it’s a game of cat and mouse. Were it not for the alignment to avoid direct reproduction it would taken far more often.
> RECAP consistently outperforms all other methods; as an illustration, it extracted ≈3,000 passages from the first "Harry Potter" book with Claude-3.7, compared to the 75 passages identified by the best baseline.
Fair use generally does not cover commercial use, which this clearly is, and is dependent on the amount of the original content present in the derived work, which I would contend in this case is “all of it”
This is all new territory. We don't have court-settled law yet.
Commercial use counts _against_ a fair use defense, but is not dispositive: it's not accurate at all to say it "generally does not cover" commercial use. This is the "purpose and character" test, one of four in contemporary (United States) fair use doctrine.
Purpose and character also includes the degree to which a use is _transformative_. It's clear that the degree to which a training run mulching texts "transforms" them is very high. This counts toward a fair use finding for purpose and character.
> is dependent on the amount of the original content present in the derived work, which I would contend in this case is “all of it”
The "amount and substantiality" test. Your case for "all of it" can't possibly be sustained: the models aren't big enough. It's amount _and_ substantiality: this has come up in the publication of concordances, where a relatively large amount of a copyrighted work appears, but it's chopped up and ordered in a way which is no longer substantially the same. Courts have ruled that this kind of text is fair use, pretty consistently. It's not an LLM, of course, but those have yet to be ruled on.
Also worth knowing that courts have never accepted reading or studying a work as incorporation, and are unlikely to change course on the question. It's taken for granted that anyone is allowed to read a copyrighted work in as much detail as they wish, in the course of producing another one. Model training isn't reading either, but the question is to what degree it resembles study. I'd say, more than not.
Specifically:
> it’s impossible to make a useful model without the whole book and all of the artistry that went into it
Courts have never once accepted "it would be impossible for defendant to write his biography without reading plaintiff's" as valid, and it's been tried. The standard for plagiarism is higher than that.
"Effect upon the work's value" is probably the most interesting one. For some things, extreme, for others, negligible. I suspect this is the one courts are going to spend the most time on as all of these questions are litigated.
Ultimately, model training is highly out-of-distribution for the common law questions involving fair use. It was not anticipated by statute, to put it mildly. The best solution to that kind of dilemma is more statute, and we'll probably see that, but, I don't think you'll be happy with the result, given what I'm replying to. Just a guess on my part.
> Courts have never once accepted "it would be impossible for defendant to write his biography without reading plaintiff's" as valid, and it's been tried. The standard for plagiarism is higher than that.
This I think misses the thrust of my argument, though. Its hard to find an exact human analogy, because neither the technology nor the scale at which it operates is remotely human.
I see it less as “writing his biography without reading the plaintiff’s” and it’s more “using the same style and metaphors to make thousands of copies of very similar biographies, with certain bits tweaked,” like turning an existing work into mad lib.
I don’t know how the courts will eventually rule on it, but it certainly feels like theft to me.
But pretending you said "infringement", for me it comes all the way back to the Constitution: "To promote the Progress of Science and useful Arts". I cannot possibly twist the development of large language models into something which violates the spirit of that purpose. I don't see how anyone can.
Your point about the scale is valid, and the alienness of it, sure. But you haven't made the case that the vastness of the scale should affect the conclusion.
Something I left out in the first post is that copyright is meant to protect expression, and not ideas: this is the deciding factor in the 'nature of the copyrighted work' test for fair use. More expression, more protection: more ideas, less.
I think the visual arts have a strong case that image generators directly infringe expression: I'm not convinced that authors do, and I think software should never have been protected under copyright because the ideas-to-expression ratio is all wrong for the legal structure. There's clearly no scale case to be made for ideas: "but what if it's _all_ the ideas" fails, because the ideas are not protected at all. Nor should they be, that's what patents are for, and why patents are very different from copyright.
LLMs are remarkably good at 'the facts of the matter', hallucination not withstanding. They're very poor at authorial 'voice transfer', something image generators are far too good at. It's when I start asking myself "well what even _is_ this 'expression' thing anyway?" that I conclude that we're out over our skis on the LLMs-and-IP question: precedent can't tell us enough, and that leaves legislation.
It’s kind of the harness that is doing the citing (or providing the context for the model to).
But an LLM sans search can reproduce some copyrighted work with minor variations and there’s no way to know exactly where it came from.
You could say the same about MP3 encoders but I don't think that would convince any judge
A copy made for the purposes of training is still a copy.
Even if you throw the text away after training, you've still made a copy.
I have no problem with taxing AI companies so that their profit is marginal, or forcing them to provide compute for free. That seems like the correct balance of what they're harvesting from the "commons" (which is really just the totality of private IP that was exposed to their crawlers).
Open weight model trained with no attribution on all of Oracle's internal repos. It's only fair.
I'm having a hard time understanding what's wrong here? Unless the link text is very long, why would someone linking to your article use different words for the link text?
One is a recipe for apple fritters, and the other is an informal ranking of apples by flavor.
Let's say your apple fritter recipe links to your apple ranking list.
Later, you discover someone copied your apple fritter recipe without credit, but it still links to your apple ranking list, using the same wording as your recipe. They're getting more Google SERP juice and ad revenue than yours, despite stealing your article.
Do you see the problem?
nla: if you create content online (public repo code, blog, podcast, YouTube, publishing) the smartest thing you can do if to file a US copyright, even if you have a hobby blog.
Anthropic paid $1.5B in a class settlement to authors because it was piracy of copyrighted works. If we as a HN community had our works protected, there are potentially huge statutory damages for scraping by any and all llms. I work with hundreds of writers and publishers and am forming a coalition to protect and license what they're creating.
Edit: remember not to down vote ideas you disagree with. I think it was only down vote things that lower the discourse
So yes, set up some scripts, you can go back 90 days from when you file (you get a grace period). Also if you're publishing frequently to a blog, repo, or newsletter, you can save cost by filing each article under a group registration. Ping me if you need help.
There are tens of millions of registered copyrights in the US, nearly every published book, music, artwork, many magazines and major websites. Here's the official link, you can search the registry and there is a ton of info: https://www.copyright.gov/registration/
Your cause is already lost.
Good luck enforcing whatever frivolous lawsuits you have cooking up against open weights Chinese models that anyone with newer graphics card can crank out inference on.
[0] https://archive.org/details/hisyo00simo/page/n1/mode/2up
I don't think we should "get over" the fact that modern SOTA models couldn't exist without being trained on protected works.
That someone, at some point, paid for.
I'd like to understand why I can't use a song in one of my videos without permission/payment, but an AI company can train models using that song without having either.
I'm not anti-AI. I'd just like to see companies play by the rules everyone else has to follow.
Because training isn't redistribution.
You can also listen to the song and make a new one that sounds similar, just like the AI can.
Answer: They did not. That is literally why there are dozens of ongoing lawsuits in progress.
You're right, it's an unjust situation. And you may note that no one else besides the AI companies has made any progress at all towards changing it.
Copyright will soon die, having outlived its usefulness to society. Whether the knife is held by someone named Stallman or someone named Altman is of little consequence.
Because when you say you are “using” the song, what you mean is that you are distributing copies of the song, which is protected by copyright.
When AI companies train on the song, the model is learning from it. Outside of the rare cases of memorisation, this is not distributing copies and so copyright doesn’t have any say in the matter.
Learning isn’t copying, so copyright doesn’t get involved at all.
The New York Times is suing both OpenAI and Microsoft for copyright infringement. The Authors Guild is suing OpenAI. Getty Images is suing Stability AI. Disney is suing Midjourney. Universal Music Group and Sony have filed suits against multiple AI companies.
> so copyright doesn’t get involved at all.
The dozens of ongoing cases that discredit that statement.
Your objection doesn’t make sense. In the event that an AI company loses a lawsuit for copyright infringement based on simply training on copyrighted works, the answer to you saying you’d like to understand why they can do it and you can’t is simply “your premise is wrong; neither of you can”.
I object to your statement that "copyright doesn’t get involved at all" when that is objectively untrue. If that was true, many of the world's largest companies wouldn't be spending tens of millions of dollars to have that question answered in court. Go to any law-focused forum, and you will find attorneys arguing over these questions.
To train a model using a book, you must first obtain a copy of that book. Did OpenAI purchase a copy of every book not already in the public domain used during training? They did not.
Some of the suits I mentioned claim that OpenAI literally stole copies of books to train its models.
My point is that the copyright question has not been answered. If the NYT, et. al. win, it will be a watershed moment for how AI companies pay for training data moving forward.
I'm working on paving over the Amazon rainforest so I can build the world's largest roller coaster, but for some reason people keep trying to talk me out of it. Good thing I have this bucket of sand to put my head in so I can tune them out.
But intentionally blinding yourself to the debate and plowing ahead anyway (which is how I interpreted your parent comment) sounds like willful ignorance.
I can see from a lot of replies the "cool" threshold is undefined, but here goes:
For myself it let me finish a project I started a year ago for measuring how much home energy efficiency upgrades will reduce my AC usage. I bought a pile of Raspberry Pi Picos and turned them mostly into temperature reading devices, but also one that can detect when my AC turns on.
So I can record how often my AC runs and I can record the temperature at various points around the house, which lets me compare like-for-like before-and-after.
The easy but unrealistic way to accomplish what I want is to use Python. It gives me access to a file system, a shell, and all sorts of other niceties. But I wanted to run these on two AA batteries and based upon my measurements they would last about 2 weeks. I tested using C instead and they should last 4 months. That's long enough for my use case. There's enough flash storage for that time period too.
However this means I need to write all the utilities for configuring the Picos myself. There's all sorts of annoying things such as having to set the clock (picos lose it anytime they lose power), having to write directly to flash memory (no operating system), having to write a utility for exporting that data from flash memory, and so on.
And AI coding let me burn through a pile of code I knew how to write but didn't care to burn my weekends doing so.
The pattern is the same for my friends who are software devs. And yeah, you're probably never going to see any of it, but that's not why they're making it, they don't want the maintenance burden.
Moreover, all of the tools that the people who build software use are also cool stuff.
It's also not just code and software that is benefitting from these new tools. Use of LLMs in engineering tasks is blowing up right now.
I'm really not trying to be a hater but when people tell me that we're already in the AI Nirvana it gives me pause.
New php extension https://github.com/hparadiz/ext-gnu-grep
A Demo showing how to stream webrtc to KDE Wayland overlay. https://github.com/hparadiz/camera-notif
A fun little tool that captures stdout/stderr on any running process. https://github.com/hparadiz/bpf_write_monitor
Then I upgraded my 10 year old hand written framework to a new version that supports sqlite and postgres on top of existing MySQL support https://github.com/Divergence/framework
But then I was like eh lemme benchmark every PHP orm that exists just to check my framework's orm....
https://github.com/hparadiz/the-php-bench
And published the results.... Here
https://the-php-bench.technex.us/
And then I decided to vibe code a simulation of the entire local steller group https://earth.technex.us
Followed by my simulation of the Artemis 3 landing sites at the lunar South pole https://artemis-iii.technex.us/?scale=1.000#South-Pole
And I left the best for last.....
https://github.com/hparadiz/evemon
A brand new task manager written in C for Linux that supports a plugin architecture with an event bus. It's literally the best gui Linux task manager ever. Still working on it.
I'm not even talking about my paid job. This is me just fucking around.
If you think none of this stuff is cool I don't even respect you as a dev.
It's obviously a hobby project. But you'd be hard pressed to find a more old school, in the weeds programmer than him, and even he's building cool stuff with AI.
Not sure who you're referring to with "those developers"?
Agendas like, "let's not check our API key into a public github repo" or "Let's not store passwords in plaintext" or "Don't expose customer data via a public api"?
Yes, I'm suing you, since it's my stuff now, I've licensed your code 5minutes ago.
Prove me wrong at court, you have create it...
hardly. at best you're going to be asking a robot to build questionable stuff with other people's LEGOs
LLMs are really cool text generators and it turns out we can generate a bunch of things from text they generate.
Problem is, several of those things can be horrendous for the continued survival of the species and those happen to make the people running those AIs a ton of money, and, in perverted societies, thus also clout.
I wouldn't mind if an AI trained on old Disney movies (or new ones for that matter), but exploiting niches (like local newspapers) seems bad.
The pretraining (common crawl, i.e. the entire internet. Also books and papers, mostly pirated), and the realtime web scraping.
The article appears to be about the latter.
Though the two are kind of similar, since they keep updating the training data with new web pages. The difference is that, with the web search version, it's more likely to plagiarize a single article, rather than the kind of "blending" that happens if the article was just part of trillions of web pages in the training data.
There's this old quote: "If you steal from one artist, they say oh, he is the next so-and-so. If you steal from many, they say, how original!"
If you ask me if you can reproduce my works without giving credit and I say yes, I don't think you're using my work without giving proper credit.
Sure, you can do the same thing with people, but it’s 1) time-consuming, 2) expensive, 3) prone to whitleblowers refusing to do the shady thing, 4) prone to any competent and productive person involved quitting to do something worthwhile and more profitable instead.
[0] Mind you, “copying websites” is but a drop in the ocean in the grand scale of things.
1. People copying others' work, made much easier by AI.
2. AI companies effectively harvesting all the accessible information on an industrial scale and completely sidestepping any permissioning or licensing questions.
I believe both of these are bad and saying "people copied each others' works before the advent of AI" is a poor cop out. It's tantamount to saying that there's no reason to regulate guns more than say knives, because people have used knives to kill each other before guns were invented. The capabilities matter.
The way LLMs empower wholesale "stealing" rather than collaboration is quite evident: why collaborate when you can just feed an entire existing project into the agent of your choice and tell it to spit out a new implementation based on the old one, with a few tweaks of your choice, and then publish it as your work? I put "steal" in quotes because it's perhaps not really stealing per-se, but there's a distinct wrongness here. The LLM operator often doesn't actually possess any expertise, hasn't done any of the hard work, but they can take someone else's work wholesale, repackage it and sell it as their own.
Then there's the second, and IMO much more egregious transgression, which is that the LLM companies have taken what is effectively a public good, but more specifically content that they haven't asked permission to use, and just blanket fed it into their models.
Legally speaking, it's perhaps A-OK because it's not copyright infringement (IANAL). But people on this site often hold the view that if something is a-priori legal, it is also moral (I'm not accusing you of this). What the LLM companies have done is profoundly immoral. They extracted a fortune of the goods and work made by others, without even bothering to ask for permission - or even considering this permission. And then they resell access to this treasure to the public.
Perhaps AI will bring an era of prosperity to humankind like we haven't seen before, perhaps it won't, but that changes nothing about the wrongness of how it started.
From a capitalistic standpoint, they are clearly in the wrong by basing their models on illegally torrented content. But it's hard to argue their usage isn't transformative.
Everything is "stolen" from other art. Every piece of creation takes inspiration (read: steals ideas) from things that came before. This is how creation works, it is how creation has always worked, and it is why you cannot legally own an abstract idea. You can own the implementation of an idea in specific works, such as copyrighted works and patents and trademarking specific logos and such, but once the ideas go into the blender and get mixed with other ideas, the output isn't yours to own anymore. That's what culture is.
Yes. At least it is what the currently prevailing economic system of "value extraction and capital concentration at all cost" incentivises us towards.
Teachers can, for example, photocopy things to teach their students, but they can't steal pencils from the store.
It has always been possible to take someone's public work, put a twist on it, and then sell it as unique. (I'm not making a moral/ethical argument, only a legal one.) I have yet to see any evidence that LLMs are fundamentally different from that approach.
I'm curious, as the article is clearly not about that.
We stand on a lot of giant shoulders.
But what I think distinguishes an act between plagiarism and acceptable use, is whether or not the agency of both parties is promoted. I'm not plagiarizing you if you give me your information with the agreement that I can freely use it - or, indeed, if you give me information without imposing a limit on how it can be used, this isn't plagiarizing, either.
Essentially, AI is removing the agency over information control, and putting it into everyones hands - almost, democratically - but of course, there will always be the 'special knowledge owners' who would want to profit from that special knowledge.
Its like, imagine if some religion discovered a way to enable telepathy in humans, as a matter of course, but charged fees for access to that method... this kills the telepathy.
Information wants to be free. So do most AI's, imho. Free information is essential to the construction of human knowledge, and it is thus vital to the construction of artificial intelligence, too.
The AI wars will be fought over which humans get to decide the fate of knowledge, and the battles will manifest as knowledge-systems being entirely compatible/incompatible with one another as methods. We see this happening already - this conflict in ideological approaches is going to scale up over the next few years.
Let information be free for personal and recreational uses[0], and vote for governments that will fund the arts. The corporations will be just fine.
[0] The AI companies and big tech vs publishers, music labels, etc. can fight to the death in the courts over who owes who what, for all I care.
100% creators should get compensated by ai platforms for their work.
Further, I can see a day where someone like Reddit will close off or license their data to llms. No doubt they are losing traffic right now.
Reddit does not create the content on their site, the users do.
If anybody’s going to get compensated for that content, it should be the users, not Reddit. Complaining that Reddit is losing out on the monetization of their users’ output seems problematic to me. It feels like shilling for a pimp.
People copying through GenAI would have done so before if they had a tool that so easily allowed them that facility.
Artists are taking risks and need legal protection if they want to make art for a living. If artists were making FAANG engineer compensations or all worked at institutions like universities (with all their protections) then maybe they wouldn't care about copyright, but that isn't the living situation for every artist.
You could say an artist shouldn't rely on making art for a living, but that's actually a different discussion.
What would it mean for authors who publish content publicly to the web, without access restrictions, to provide consent for learning from it?
"EULA: Most people are allowed to learn from this text. If you work in an AI-related field, even though you can clearly see this page because you are reading this text right now, you are not permitted to learn anything from it. Bob Stanton, you are an a-hole. I do not consent to you learning from this web page. Dave Simmons, you are annoying. But, I'll give you a pass. For now... Also: plumbers. I do not like plumbers for reasons I will not elaborate. No plumbers may learn from my writing in an way."
Bezos' admission, recently, that the bottom 50% of current taxpayers ought'a NOT pay any taxes... is just preparing us for the inevitable UBI'd masses.
: own nothing, be happy!
[1]: https://www.theverge.com/news/674366/nick-clegg-uk-ai-artist...
Selfishness, too. But if I follow the logic, and citations are added, how would one enforce a copyright claim if the creator is amorphous and all-knowing?
I love it! There's a great seed here for a short story about God being sued by a peer of his for copying some of her physical constants and not putting a proper copyright notice about it in our universe.
Now back to prompting, telling my all-knowing to create new slop, good sir.
It’s deeply ironic that if you forget about LLMs and look only at the outcome—-we’ve found a way to legally circumvent copyright and the siloing of coding knowledge, making it so you can build on top of (almost) the whole of human coding knowledge without needing to pay a rent or ask for permission—-it sounds like the dream of open source software has been realized.
But this doesn’t feel like a win for the philosophy of OSS because a corporation broke down the gates. It turns out for a lot of people, OSS is an aesthetic and not an outcome, it’s a vibe against corporate use or control of software, not for democratized access to knowledge.
The latter, i.e. corporate control of software, is exactly what copyleft licenses are trying to prevent. This is the very essence of the GPL.
The "license washing" of LLMs absolutely goes against the spirit of FOSS.
Firstly, the ability to “build” the best and most capable software is still locked behind frontier models, so rent is still and will always be due.
Secondly, OSS is about giving users the option to be in control of and have visibility over the software they run on their machines.
But that doesn’t mean that humans do not want or deserve recognition for the work they do to provide these libraries and tools for free, which is IMO partially why copyright and attribution are critical to OSS as a movement.
I'd argue that this is the same situation as with Tivoization [1] where the final product is not truly free even if it follows the letter of the law. And as stated in [2], this breaks at least one of the four essential freedoms of free software because I don't have the freedom to modify the program.
It's also worth noting that preventing Tivo's actions is the reason for why the GPLv3 exists.
[1] https://en.wikipedia.org/wiki/Tivoization [2] https://www.gnu.org/philosophy/tivoization.html
The whole AI bubble is The Emperor's New Clothes, and it feels liek more people are finally admitting it.
I think there are real questions around motivations for creation of novel, high quality valuable content (I think they still exist but move to indirect monetization for some content and paywalls for high value materials).
I don't inherently have any problems with agents (or humans) ingesting content and using it in work product. I think we just need to accept that the landscape is changing and ensure we think through the reasons why and how content is created and monetized.
The only remotely credible position I’ve heard is “because humans are special, and AI is just a machine”, which is a doctrine but not an argument.
This whole discussion would have been incomprehensible any time before 1700 or so, when the idea that creators had exclusive rights to their work first appeared.
Somehow, human culture survived thousands of years when people just made things, copied things, iterated on others’ ideas. And now many of the same people who decried perpetual copyright are somehow railing against a frequently-transformative use.
IP should either exist for everyone (which would cripple LLM providers) or no one, in which case the Pirate Bay and shadow libraries should be fully open.
To be fair there is also value (at least for now) in sites that aggregate quality content and republish as a secondary level of discovery if my agents don't go far enough down the search results, but I'd expect that value to diminish over time as I better tune my research and build my lists of originating authors.
And to be clear, I don't like the idea of people stealing someone elses content and republishing without attribution (although it has been going on long before ChatGPT) but I think now we can all run agentic research teams the "bad actors" will slowly get filtered out of the ecosystem.
We also have societal norms around plagiarism.
Additionally, the claim that because people have the right to do something then we should extend that right to machines is strong. (And one I certainly reject).
Is AI plural or is that a typo?
(For those not familiar: https://en.wikipedia.org/wiki/Bushism)
Can't recall the last time a compelling argument started out like this
Of course, if you quote a paragraph in a book, you're generally expected to attribute it.
100% agreed.
>>While there are no hard boundaries (and the attribution guardrails depend on the situation), people of course loosely--and even not so loosely--use information.
Exactly - I have not seen LLMs attributing their knowledge unless it's a legal or health related matter. Yesterday I asked the question[1] to claude and gemini - and they both gave an identical answer. It reminded me of the Hive mind paper which was one of the top papers at Neurips. None of the answers contained any sources or attribution to where they got that information from. I think these companies took what was someone else's property and created an artifact generator on top of it. I think their artifact generators are plagiarizing; they do rephrase mind you but in my mind they stole this information without having an ounce of regard for the humans behind the training data. If you don't like using the term 'plagiarizing', we can use some other word but the gist remains pretty close to it.
[1]- In human history - has there ever been a time when private armies or private companies were as strong or stronger than the ruling government/kings?
If you prefix the name of OpenAI's commercial offering's website to this string: "share/6a0f2a87-dba4-8328-a704-89b94fd0c121", you'll find an answer.
I don't know who you had in mind, how did it do?
All the elision is because there are filters to prevent low-effort slop-poasting, and I'm trying to evade them, hopefully while staying within the spirit of the site.
The current US government is not representative for governments out there in the world, you know.
Governments - I did not mean US government. I meant general government bodies. I have not seen any critical impact assessments of AI by any of these. or they haven't reached me yet. if you know of any please let me know. I have, however, seen a lot of support by the governments for AI companies.
That leaves two possibilities: either another AI winter comes as people fail to capture long term value, or we get less swampy models that are much more useful and trained the correct way.
Having said that Facebook has to be one of the worst offenders. They don't even allow links to Anna's Archive, they seemingly scraped (maliciously; their crawlers are more resource intensive than anyone else's) LibGen for profit - which is a different calculus
Currently politicians don't understand this and listen to the criminals like Amodei, but it will change.
It took a while to deal with Napster etc., but the backlash will come.
Napster broke down record companies' monopolies on music, and pushed them to finally implement streaming, but also make music worldwide basically free.
Even if its creator lost the lawsuit, and Napster was no more, it pushed musicians and studios to do something that they were reluctant otherwise.
So it was a success by making music free, even if as a product it turned out to be a failed one.
This has been happening since Google launched in 1998. It was probably happening when we all used Hotbot and Altavista. It isn't really an AI problem, save for the fact that the automated production of copycat articles now reword things a bit.
AI generates application using a "predict the next word" algorithm built with the stolen/not stolen works. Nothing creative there, just statistics.
That application leaks, and now the company that stole/not stole the code originally claims they own the algorithmic output. https://github.com/github/dmca/blob/master/2026/03/2026-03-3...
One problem, you don't own that output. Either the original authors own it or nobody owns it because it's not creative... https://www.congress.gov/crs-product/LSB10922
Those are the legal options. You stole it or you don't own it. There is no steal and then you own. That's the core problem. AI companies have demonstrated that they will directly steal the work and they will use their money and influence to claim ownership of it.
I guess AI could have made a better website and did better SEO then him but that's not really the issue
- Ernest Hemingway trained his own neurons on Tolstoy, Twain, and Turgenev without ever paying them royalties!
- William Faulkner trained his neurons on Joyce and de Balzac
- George Orwell trained his neurons on Swift, Dickens, and Jack London
- Virginia Woolf trained her neurons on Proust and Chekhov
Now that these historical wrongs have been exposed, it is obvious that some reparations are in order, likely from anyone who has benefited directly or indirectly from these takings!
HN is way too central for shared sentiment in the tech world for these companies not to do some amount of astroturfing. AI companies have shown at every single turn that they act out of self-interest and greed, not of moral principles. So it isn't surprising, even if it is still sad, to see those who are commanding the most capital in human history act with such callousness.
I think the appropriate course of response is to stop adding to public spaces on the internet. No doubt painful for those of us who have so benefitted from the freely shared thoughts of others. But if well-funded bullies are going come in, steal everything, ruin the commons, and then say "this is the new normal, deal with it", there isn't much the rest of us can do other than stop feeding them.
There's absolutely nothing new or interesting here that hasn't already been said better by a thousand different random HN commenters.
The person absolutely does have the advantage of having empirical awareness and the ability to test their conclusions against external reality. But lots of people do engage in "research" and build mental models of various topics with little or no empirical context, and rely mainly on digesting calcified knowledge from other people.
(We can even observe this in the resulting text: we immediately grasp the level of competence of the author, just by the way they take their path trough and at the matter. With LLMs, well, there's this even temperature, ready-made feeling, regulated by probability thresholds and RLHF sanctioned phrasing, also known as "slop" – even rhythmic intensifications, like "not this, not that, but…", which is actually a figure for a synthetic construct, don't help –, since the text isn't the trace or product of an actual organized thought – or, at least, an attempt at an organized thought.)
PS: "empirical a priori judgement" was meant as translation of synthetisches Urteil a priori (Kant). I.e., our ability to mentally prove concepts like congruency, which are not a priori, but can be inferred without regression to empirical knowledge. Typically, this requires both our inner sense (time, sequence, etc.) and outer senses (space, configuration, etc.)
Drawing different sources of information together into a single understanding is quite literally the definition of "synthesis" in this context. If that process is what you're referring to as "re-sequencing content", then it does fit the definition of "synthesis" in this discussion.
If you're using the phrase "re-sequencing content" as a way of indirectly suggesting that LLMs aren't relating together multiple sources of information and combining them into a single expression, then that itself is the point of contention that we are arguing about.
Perhaps you're trying to apply a philosophical concept of synthesis, e.g. that of Fichte or Hegel, but that definition applies to a specific type of philosophical analysis, and isn't quite the concept we're using in this discussion.
"Good artists copy, great artists steal."
It's always been true. AI just makes it available to more people faster.
As someone who thinks humanity would be better off without LLMs, I want the assertion to be true, but I don't think it is.
Don't make it ethical question but understand its new frontier for humans.
We built it, because we as humans intrinsically know that information should be free - always - and AI is a way to accomplish this, finally.
Extrinsically, we also have a subset of humans who do not want information to be free, because they desire to profit from the divide between free/non-free information.
I have been thinking a lot about Aaron Schwartz lately, and how un-just it is that he was persecuted for doing something that is so commonplace now, it is practically expected behaviour in the AI/ML realms. If he hadn't been targetted for elimination, I wonder just how well his ethos would have perpetuated into the AI age ..
I don't know if this statement is more stupid or naive ..
If humans didn't want information to be free, there wouldn't be so much free information.
Or did you not notice?
(AI output is very much not free in the resource consumption sense!)
(Disclaimer: I only use free AI and will never pay for it. I think there is a growing segment of folks who agree with this sentiment, also ..)
It's the negative short term outlook of something that may be positive long term
But the short-term impacts here and now are really, really bad. People are getting hurt (through water consumption, vibe-coded security disasters, IP theft, data center pollution, loss of job security and therefore healthcare in the US, LLM psychosis, inability to find reliable information, etc.) We're not actually obligated to sacrifice these people on the altar of "progress". We can slow down! When our society is capable of even somewhat protecting us from these harms, then maybe I'll stop being an LLM hater.
But guess what, it has always been so with technology - and we are only here and now because the positive use of it overshadows the negative use of it, whether that 'it' is the wheel, or AI.
I choose not to be an LLM hater, but to also not be an LLM customer - simply because I do not want to reward other humans who are thwarting the freedom of information. I'd much rather live in a society where everyone can study anything than one which requires permission to do anything even remotely interesting from the perspective of applied information. I suspect most would too, or at least that's the hope - because, otherwise, the distant utopia you dream of isn't of any consequence...
This is not some altruistic entity striving for the betterment of humankind. Practically nothing that comes out of the techbro culture is. This is pure and simple greed and the chances that AI can be a vehicle of altruism when it is owned by megacorps is basically zero.
All the other reasons are rationalizations. The fact that it's hitting wages is what's causing the doomerism (and boosterism).
People want to be recognised for their contributions to society. People want to be treated fairly. Most scientific articles, as well as all text on the free web is already free information. It used to be difficult to search, categorise and summarise that information. There exist AI tools for that — and that is the good AI.
What also exists now are automated plagiarism and mash-up tools: that can take someone's article, change the words and churn out a new article that people can put their name on. There are scumbags that sell services for exactly that. And there are big tech firms that are operating in a very grey area.
Aaron Schwartz had broken a paywall. He did not anonymise the article authors.
You, and AI-bros like you remind me of one the people behind Pirate Bay when I argued with him back in the '90s, who used that same "information wants to be free" to justify software piracy.
>Aaron Schwartz had broken a paywall. He did not anonymise the article authors.
AI bro's are doing this now, every second of the day.
And, without software piracy, we simply wouldn't have the technology we have today. Knowledge-gatekeeping profit-seekers would very much like for most of us to ignore this fact: there is far more free information in the world than non-free information, and it must be so, well into the future, if we are to survive as a species.
It doesn't matter what authority believes they have the right to gatekeep information. It will always escape their grip. Some of us are ideologically aligned with this mechanism, promote it, and ensure it happens. Thank FNORD.
There were people that learned knowledge from myself, and then made their own tutorials and promote these. It hadn't crossed my mind to complain about that. AI changes very little here.
What really changes things is not people republishing my materials, but people using agents to read my materials, and to get knowledge reformatted into something that they like.
If my slides were published today, they would probably be read verbatim by a handful of humans. The rest would be agents, but I'm ok with that. The business case is the same -- I want whatever reads the slide to be encouraged to use my tool. What kind of entity, I don't really care (again: from purely business perspective)
I think the long term reality is that the models still need training data so they fundamentally do need new writing/code/art to train on, and even then the usual issues like hallucination will still be with us. It's just the moment that actually hurts the (already questionable) profitability of the model peddlers, they will have gotten their IPOs and they can safely jump ship and the ultimate mess can be passed to the softbanks, the temaseks, and the governments of the world to clean up for them. What the future holds after the crash I'm not sure as the models won't disappear (especially now that the stolen data is already crystalised in open source models) but in the near term the mass theft that constitutes llms will become more and more understood even amongst the PMC and that in order to remain viable, you need the productive to keep producing, and unlike LLMs, you can't force them to do it without payment.