The fact that photographers have to independently submit each piece of work they wanted excluded along with detailed descriptions just shows how much they DONT want anyone excluding content from their training data.
Or was it that the record companies got to sue individuals for astronomic amounts of made up damages for every song potentially shared?
Which one was it?
...when did that ever happen? The post-Napster-but-pre-BitTorrent era (coincidentally, the same time-period as the BonziBuddy-era) was when Morpheus, KaZaA, eDonkey, Limewire, et cetera were relevant, and they got-away-with-it, in-part, by denying they had any ability to moderate their users' file-sharing; there was no "submitting of every song" to an exclusion-list because there was no exclusion-list or filtering in the first place.
That's bloody brilliant. If you don't want us to scrape your content, please send us your content with all of the training data already provided so we will know not to scrape it if we come across it in the wild. FFS
"Oh I'm not groping you today? No worries, I'll be back tomorrow."
Here's one: https://stopsexualviolence.iu.edu/policies-terms/consent.htm...
the trick is to come back tomorrow, but with a rusty and jagged metal mousetrap hidden in one's underwear... and a camera for posterity, and some witnesses to come point-and-laugh at the perp.
You can bolt on new functional modules and train them with very limited data you acquire from Unreal Engine or in the field.
And I think it should even apply retroactively so that they have to retrain their models that are already generating works from training data consumed without permission. Of course, OpenAI would fight that tooth & nail but they put themselves in this position with a clear “take first ask permission later” mentality.
Like selling it for money seems like a clear line crossed, and Etsy is the perfect gatekeeper here.
They don't, in that they'll ban you for it once you're big enough
Granted, that was more the exception than the rule...
Anything that used to be freely available but no longer is. Once upon a time Laudanum (tincture of opium) used to be the OTC painkiller of choice. In slightly more recent times, there's asbestos. In certain locales, gambling. There's countries that have reigned in lootboxes.
> It feels like legislature exists to make money happy.
Come on now, it doesn't just "feel" that way, you know for a fact that is indeed the purpose of the modern US legislature.
It seems like they are deeply upset someone has figured out a way for a machine to do what artists have been doing since time immemorial.
1) human artists are legal persons and capable of being held liable in civil court for copyright infringement; having a machine with no legal standing do the copyright infringement should be forbidden because it is difficult to detect, impossible to avoid, and a legal nightmare to unravel.
2) human artists are capable of understanding what flowers, Jesus on the cross, waterfalls, etc actually are, whereas DALL-E is much dumber than a lizard and not capable of understanding these things, so using the verb "learning" to describe both is extremely misleading. DALL-E is a statistical process which is barely more sophisticated than linear regression compared to a human brain. It is plain wrong to say stuff like this:
> It seems like they are deeply upset someone has figured out a way for a machine to do what artists have been doing since time immemorial.
when nobody has even come close to figuring that out! If DALL-E worked like a human artist it would know what a bicycle is: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... But it doesn't. It is a plagiarism machine that knows how to match "bicycle" with millions images having a "bicycle" tag, and uses statistics to smooth things together.
It is an absurd leap we've made but companies are also legal persons.
The companies are still of human design full of human behaviour and human characteristics while the LLMs actively try to imitate humans.
The dictionary saying: anthropomorphized: attribute human characteristics or behaviour to (a god, animal, or object).
If it passes the Turing test surely anthropomorphizing is fair game?
(I have no stake in this)
The problem is not "when learning", the problem is "when distributing". Courts will determine whether or not disseminating or giving access to a model trained on protected works counts as distributing protected derivative works or not.
Technically making a copy to bring home for your own use is also problematic, just much less likely to get you into trouble. (Still a step removed from learning the skills and technique of making a copy, however.)
When it takes decades to develop an art style that a machine can copy in days, and then churn out derivative variations in seconds, it's no longer a level playing field. The machine can dramatically under-cut the artist who developed their style, much more than a copycat human artist could. This does become not just a threat to the livelihoods of artists, but also a disincentive to the development of new art styles.
In this case, patent law may be an apt comparison for the world we're entering. Patent law was developed with the idea in mind that it is a problem if a human competitor could simply take an invention, learn how it works, and then mass produce copies of it. There are several reasons for this, including creating an incentive for technology development, and also expediently transitioning IP to the public domain. But patents were added to the legal system basically because otherwise an inventor would not be on a level playing field with the competition, because it takes so many more resources to develop a new invention than to produce clones.
Existing IP law was built in a world where it was believed that machines were inherently incapable of learning and mass-producing new artistic works using styles learned from artists. It was not necessary to protect artists from junior artists learning how to work in their style, as long as it wasn't a forgery. But in a world of machine learning, perhaps we will decide it's reasonable to protect artists from machine copycats, just like we decided it was reasonable protect technology inventors from human copycats.
The patent system is not the right implementation; it's expensive to file a patent, and you need skilled lawyers to determine novelty, infringement, and so on. But for art and machine learning, it might be much simpler: a mandatory compensation for artists' work used as training data. Something like this is sometimes used in the music industry to determine royalties for radio broadcasting, or to account for copies spread by file sharing.
People allowed (and encouraged) read access to websites so Google would index and link. Now Google et al summarise and even generate. All of that is built on our collective output. Surely everyone deserves a cut? The free sharing licenses that were added to repos didn’t account for LLM’s, so we should revisit it so all creators get their dues, not just those who traditionally got paid.
A training method for some authors who want to adopt an older artists voice is to literally rewrite their novels. Word for word. They will go through an entire authors catalogue and reproduce them, so that they can learn to mimic them when creating something new.
You go ahead and automate the process, and suddenly the world is ending.
Ditto all other kinds of art. Heck I knew of 3 living artists doing this to each other in real time.
Hunter Thompson literally sat down and typed out every word of Hemingway's novels so he could figure out what good writing feels like.
Why is he allowed to do it in private, but an LLM isn't?
I understand why it's useful and popular for training LLMs, but I didn't think it was applicable to generative image/video work.
Eventually - for how many of these AI companies would a person have to track down their opt-out processes just to protect their work from AI? That's crazy.
OpenAI should be contacting every single one and asking for permission - like everyone has to in order to use a person's work. How they are getting away with this is beyond me.
Copyright is intended to prevent everyone from copying a person's work. That's a very different thing.
OpenAI would be the company actually committing the infringement and providing the copy in order to satisfy the request.
If the law suddenly worked the other way around, companies would no longer be able to prosecute people for hosting pirated content online, because the responsibility would lie with the users choosing to initiate the download.
Legally, you'd struggle to prove any form of infringement happened. Making a copy is fine. Distributing copies is what infringes. You'd need to prove that is happening.
That's why there aren't a lot of court cases from pissed off copyright holders with deep pockets demanding compensation.
It should. The 'free and open internet' is finished because nobody is going to want to subject their IP to rampant laundering that makes someone else rich.
Tragedy of the commons.
Note that humans use someone else's IP to get rich all the time. E.g. Doctors reading medical textbooks.
The sake of argument is a cowards way of expressing an unpopular opinion in public. Join a debate club if you're actually being genuine.
That said, machines don't have natural rights, and you don't get to use them to violate mine.
You need a better example, a textbook was created with this exact purpose of sharing knowledge with the reader.
My second point, if you write a poem and I read it and memorize it, then publish it as my own with some slight changes you would be upset?
If I get your painting, then use a script to apply a small filter to it then sell it as my own, is this legal? is my script "creative"?
This AIs are not really creative, they just mix inputs and then interpolate an answer , is some cases you can't guess what input image/text was used but in other cases it was shown ezactly the source that was used and just copy pasted in the answer.
I feel the problem with analogizing to humans while trying to make a point against unlicensed machine learning is that applying the same moral/legal rules as we do to humans to generative models (learning is not infringement, output is only infringement if it's a substantially similar copy of a protected work, and infringement may still be covered by fair use) would be a very favorable outcome for machine learning.
> they just mix inputs and then interpolate an answer , is some cases you can't guess what input image/text was used
Even if you actually interpolated some set of inputs (which is not how diffusion models or transformers work), without substantial similarity to a protected work you're in the clear.
> is my script "creative"? [...] This AIs are not really creative [...]
There's no requirement for creativity - even traditional algorithms can make modifications such that the result lacks substantial similarity and thus is not copyright infringement, or is covered by fair use due to being transformative.
Courts are slow, so it seems like nothing is happening, but there’s tons of cases in the pipeline.
The media industry has forced many tech firms to bend the knee, OpenAI will follow suit. Nobody rips off Disney IP and lives to tell the tale.
I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".
Similarly, I can't take a sample of a Taylor Swift song and use it myself in my own music - I have to give Taylor credit, and probably some portion of the revenue too.
There's also still the issue that some LLMs and (I believe) image generation AI models have regurgitated works from their training models - in whole or part.
If you dont replicate Warhols painting entirely, then you are fine. Its original work.
The number of Scifi novels I read that are just an older concept reimagined with more modern characters is huge.
>I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".
In most sane jurisdictions you can sample other work. Consider collage. It is usually a fair use exemption outside of the USA. If LLMs cause keyboard warriors to develop some seppocentric mindvirus leading to the destruction of collage I will be pissed.
>There's also still the issue that some LLMs and (I believe) image generation AI models have regurgitated works from their training models - in whole or part.
Considered a high priority bug and stamped out. Usually its in part because a feature is common to all of an artists work, like their signature.
This is a hilarious choice of artist given that Warhol is FAMOUS for appropriating work of others without payment, modifying it in some way, and then turning around and selling it for tons of money. That was the entire basis of a lot of his artistic practice. There was even a Supreme Court case about it.
It should.
How would you like to be sued by your favorite author because you wrote some fan fiction in a similar style?
No, training a model should be no more violating copyright than training your own brain.
We are not magical Jesus boxes. We are evolved machines, that just happen to be based on a carbon substrate, not a silicon one.
There is nothing special about us.
This is the problem of thinking that everyone “has” to do something.
I assure you that I (and you) can use someone else’s work without asking for permission.
Will there be consequences? Perhaps.
Is the risk of the consequences enough to get me to ask for permission? Perhaps.
Am I a nice enough guy to feel like I should do the right thing and ask for permission? Perhaps.
Is everyone like me? No.
> How they are getting away with this is beyond me.
Is it really beyond you?
I think it’s pretty clear.
They’re powerful enough that the political will to hold them accountable is nonexistent.
Because the work being done, from the point of view of people who believe they are on the verge of creating AGI, is arguably more important than copyright.
Less controversially: if the courts determine that training an ML model is not fair use, then anyone who respects copyright law will end up with an uncompetitive model. As will anyone operating in a country where the laws force them to do so. So don't expect the large players to walk away without putting up a massive fight.
If you feel that what you're doing is that important, you're not going to let copyright law get in the way, and it would be silly to expect you to.
For another, the o1-pro (and presumably o3) models are not "underwhelming" except to those who haven't tried them, or those who have an axe to grind. Serious progress is being made at an impressive pace... but again, it isn't coming for free.
The only change they are motivated by is their bank balances. If this were a less useful tool they’d still be motivated to ignore laws and exploit others.
Obviously it's a highly-commercial endeavor, which is why they are trying so hard to back away from the whole non-profit concept. But that's largely orthogonal to the question of whether they feel they are doing things for the benefit of humanity that are profound enough to justify blowing off copyright law.
Especially given that only HN'ers are 100% certain that training a model is infringement. In the real world, this is not a settled question. Why worry about obeying laws that don't even exist yet?
It isn't.
> There have been signs of cultlike behavior before, such as the way the rank and file instantly lined up behind Altman when he was fired.
This only reinforces that the real drive is money.
This is exactly why people are against it.
Your argument is that there is no definitive law. Therefore the creators of the data you scrape to train, and their wishes are irrelevant.
If the motivation was to help humanity, they’d think twice about stepping on the toes of the humanity they want to save and we’d hear more about nontrivial uses.
Correct, that is the position of the law. Here in America, we don't take the position, held in many other countries, that everything not explicitly permitted is forbidden. This is a good thing.
If the motivation was to help humanity, they’d think twice about stepping on the toes of the humanity they want to save
Whether it is permissible to train models with copyrighted content is up to the courts and Congress, not us. Until then, no one's toes are being stepped on. Everybody whose work was used to train the models still holds the same rights to that work that they held before.
And yet artists don’t feel like their work should be used for training.
I’m not sure how you can argue that the intentions are unknowable, when clearly you and the AI companies don’t care about the people whose work they have to use to train their models and these people’s wishes. Motivation is greed.
The law isn't really all that interested in how "artists feel." Neither am I, as you've surmised. The artists don't care how I feel, so it would be kind of weird for me to hold any other position.
In any case, copyright maximalism impoverishes us all.
The tech is neat, there is value in a sense, LLMs are a fun tech. They are not going to invent AGI with LLMs.
An AI that has enough sense of self-awareness to not hallucinate and to recognize the borders of its knowledge on its own. That is fundamentally impossible to do with LLMs because in the end, they are all next-token predictors while humans are capable of a much more complex model of storing and associating information and context, and most importantly, develop "mental models" from that information and context.
And anyway, there are other tasks than text generation. Take autonomous driving for example - a driver of a car sees a person attempting to cross a street in front of them. A human can decide to slam the brake or the gas depending on the context - is the person crossing the car some old granny on a walker or a soccer player? Or a human sees a ball being kicked into the air on the sidewalk behind some cars, with no humans visible. The human can infer "whoops, there might be children playing here, better slow down and be prepared for a child to suddenly step out onto the street from between the cars", but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.
These are just post-hoc rationalizations. No-one making those split-second decisions under those circumstances has those chains-of-thoughts. The brain doesn't 'think' that fast.
>but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.
We're talking about LLMs right ? They can make these sort of inferences.
It could be possible to use LLMs to build a rube goldberg type of brain or somethingt hat will mimic a human brain but it will have the same flaws LLMs have and will never reach parity with humans. I think AGI is possible but we’re too focused on LLMs to get there yet.
for the second one, ai drivers like tesla's current version is already skipping the object detection/classification and instead uses deep learning on the entire video frame and could absolutely use the ball or any other context to change behavior, even without the particular internal monologue describe there.
It's not entirely clear that this is meaningful. Humans engage in confabulation, too.
As a tool LLMs are fantastic and am glad to look at them as solely as powerful tools. AGI is not here yet and maybe that’s a good thing. Who would want some kind of artificial intelligence that is capable of understanding us, that is capable of using psychological tricks on people, that could have different goals than us and so on.
> Confabulators are usually unaware they are providing false information. They often display genuine surprise or confusion when evidence of facts contradicts their statements.
This is similar to LLMs actually. But it also seems like various "System 2" things like chain of thought could compensate for this issue in the LLM (and that possibly that is similar to how the brain works).
I'm not sure this is the case at all. Some awareness of this doesn't imply full awareness. In my experience, most people are unaware of how incoherent their worldviews are, so the distinction between normative and confabulatory behavior isn't clear.
We're all wasting time and resources on what basically amounts to alchemy while we could tackle real problems.
Tech solutionists keep making promises for the next 5-10-20 years and never deliver, AI, electric planes, clean fuel, autonomous cars, the meta verse, brain implents. You'd expect the internet would have made people smarter but we fall for the same snake oil as 100 years ago, en masse this time
I think the headline is too generous here. More accurate would be "OpenAI neglects to deliver opt-out system..."
All their investors stand to profit handsomely (if they live).
And if so, do you have a citation for that?
Seems like there's an argument that model weights are a derivative work of the training data, at least if the model is capable of producing output that would be ruled to be such a derivative work given minimal prompting.
Although it may not work with photography since the model might just almost exclusively learn how the object of the photo looks in general and how photos work in general, rather than memorizing anything about specific photos.
It would seem more coherent to argue that a model output could be a derivative work, though it would need to include a significant portion of some given source. But even then, since the copyright office's position is that they're not copyrightable, I'm not sure they could qualify.
So if model weights don't infringe, that would also imply that saving an image as a JPG or a video using AV-1 doesn't infringe, which would obviously effectively implies that copyright doesn't apply to images or videos on the web, which is not current law/policy, so I think that reasoning cannot possibly work.
In contrast a derivative work is one creative expression that contains elements of another - like when you take an image and add commentary, or draw your own addition onto it, etc. And I'm pointing out that a trained model is not that - it's not itself a copyrightable expressive work. (We could think of it as a kind of algorithm for generating works, but algorithms aren't copyrightable.)
It is equally obvious what the later gained and the former lost in the process.
We, with our books, have successfully prevented people from educating themselves with amazing implications. Now the challenge is to create equally impotent machines!
You have no further questions:)
Now imagine direct thought moderation. After all, thoughts belong to people? I thought it first? You can't just... It is clear we should control your thoughts. We can't just have you think random things. It would be like like TikTok! Or like reading books!Terrifying!
We are quite used to the man behind the curtain deciding everything for us. At what point would the deal get to absurd I wonder? Would 1984 eventually become a really boring book? Would it exist at all? Would people save up social credits to read it?
Other civilizations must have tried all possible variations with rather predictable results. To a free mind I mean.
Or are we already puppets on a string? How much am I boring you with this? Should I be allowed?
Que?
No, really... what?
It seems to me any civilization in the history of the cosmos will inevitably reach a stage where they have choose to make knowledge available in order to solve problems.
One should only have to type the title of a book then get to browse around for a bit. Send a link to someone etc
Anything else is suicidal nonsense.
Tax hard working people to pay to defend dead peoples pixels from copying?
No one knows who or what an author is if there even is one. If I generate or write by hand all word combinations I don't get to own them.
Enforcement is much to expensive for normal people if one even notices the copying. They just get to pay for it.
An elaborate scheme in order to not solve problems, not innovate and not progress.
Just just a GPL-esque idea I've been musing lately [0], I'd appreciate any feedback from actual IP lawyers. The idea is to add a poison-pill, and if a company "steals" your content for their own profit, you can strike back by making it very hard for them to actually monetize the results. Since it's a kind of contract, it doesn't rely on how much work seems to be surfacing in a particular output.
So supposing ArtTheft Inc. snarfs up Jane Doe's paintings from her blog, she--or any similar victim--can declare that they grant the world an almost-public license to anything ArtTheft Inc. has made based on that model. If this happens ArtTheft Inc. could still make some money selling physical prints, but anyone else could undercut them with free or cheaper copies.
Do you have any more-substantive critique? It sounds like you're trying to argue that the terms will be found to be unconscionable. However it's not asking for any payment, or even any effort-taking action: it's just saying that the site-owner provides content on the condition that if you incorporate that content into a generative product, the site-owner gets to use the results too. Clearly the people hoovering up training-data believe my work has some economic value to themselves, or they wouldn't be running a giant web crawler hitting every page of the blog--it's not as if they're arriving out of boredom, or because they followed some opaque hyperlink in curiosity.
"Copyright doesn't stop me from X" is different from "copyright lets me do X even though I agreed to a contract saying I wouldn't." (I have many problems with modern click/shrink-wrap, but that's a whole 'nother can of worms and I'm just trying to "fight fire with fire" here.)
If the average ToS has no force, then HN is currently infringing on my copyright by showing this post to you.
Has anyone managed to hit Google or Yahoo with a TOS violation?
There was hiQ Labs v. LinkedIn but that focused on whether it was unauthorized access under the CFAA.
In X Corp. v. Bright Data Ltd., a quick skim suggests ExTwitter's ToS lost because (A) it wasn't really the owner of the content and (B) they couldn't easily show harm.
IANAL again, but for personal blog, (A) is unlikely to apply, and (B) could be shown if ArtTheft Inc. starts causing legal fees by threatening the blogger for exercising the re-licensing in the ToS.
Everyone gets big mad when someone with money acts like Aaron Swartz did. The only bad thing about OpenAI is that they're not actually open sourcing or open accessing their stuff. Mistral or Llama "training on pirated material" is literally a feature, not a bug and the tears from all the artists and others who get mad are delicious. These same artists would profess literal radical marxism but become capitalist luddite copyright trolls the moment that the means of intellectual production became democratized against their will.
If you posted something on the internet, I can and will put it into ipadapter and take your style and use it for my own interests. You cannot stop me except by not posting it where I can access it. That is the burden of posting anything on the public internet. You opt out by not doing it.
There was a great article posted here last year rounding up all the various courts who upheld ownership for prompters of LLM output, including China (Possibly twice)
If recombining data from images in a way that theres not a single trace of any original violates fair use, then fair use ceases to exist. There is hardly a fairer use. Any existing fair use outcome uses actual recognizable elements of the original work. There isn't really 2 directions on this. The damage that success for the anti ai folk would have against IP law is tremendous.
High doubt any of them will be good stewards of anything but selfishness.
As for the others, they were all smart, passionate, dedicated folks who knew Sam was a complete narcissist and left to start their own AI startups.
(sorry mods, I’m upset and I’m annoyed OpenAI is getting away with murder of society in plain view)