OpenAI fails to deliver opt-out system for photographers(petapixel.com)

201 pointsby onetokeoverthe9 months ago17 comments

toddmorey9 months ago
No way OpenAI will ever “good citizen” this. Tools to opt out of training sets will only come if they are legally compelled. Governments will have to make respecting some sort of training preference header on public content mandatory I think.
The fact that photographers have to independently submit each piece of work they wanted excluded along with detailed descriptions just shows how much they DONT want anyone excluding content from their training data.
- andrei_says_9 months ago
  Reminds me of the time when p2p music sharing became popular and the record companies had to submit every song they did not want to get shared along with an explanation to every person who had Napster installed.
  Or was it that the record companies got to sue individuals for astronomic amounts of made up damages for every song potentially shared?
  Which one was it?
  - DaiPlusPlus9 months ago
    > and the record companies had to submit every song they did not want to get shared along with an explanation to every person who had Napster installed.
    ...when did that ever happen? The post-Napster-but-pre-BitTorrent era (coincidentally, the same time-period as the BonziBuddy-era) was when Morpheus, KaZaA, eDonkey, Limewire, et cetera were relevant, and they got-away-with-it, in-part, by denying they had any ability to moderate their users' file-sharing; there was no "submitting of every song" to an exclusion-list because there was no exclusion-list or filtering in the first place.
    orra9 months ago
    This almost got me too, but you missed GP’s point. See their next paragraph beginning “Or”. (It's about double standards for individual versus corporate copyright infringement.)
- recursivecaveat9 months ago
  100%, like most opt-outs this exists as a checklist feature that proponents can point to and hopefully convince bystanders. You muddy the waters by allowing someone to with great effort technically possibly achieve the thing they want, maybe, for now, until you close it in 2 years and everyone says "well that makes sense nobody used that feature anyways".
- dylan6049 months ago
  > The fact that photographers have to independently submit each piece of work they wanted excluded along with detailed descriptions just shows how much they DONT want anyone excluding content from their training data.
  That's bloody brilliant. If you don't want us to scrape your content, please send us your content with all of the training data already provided so we will know not to scrape it if we come across it in the wild. FFS
  - nicbou9 months ago
    The tech industry’s understanding of consent is terrifying.
    dylan6049 months ago
    Understanding is a curious choice of words. I’d have gone with total disregard
    frizlab9 months ago
    Or even contempt for?
    isoprophlex9 months ago
    Mirrors that of a sexual predator.
    "Oh I'm not groping you today? No worries, I'll be back tomorrow."
    DaiPlusPlus9 months ago
    > "Oh I'm not groping you today? No worries, I'll be back tomorrow."
    the trick is to come back tomorrow, but with a rusty and jagged metal mousetrap hidden in one's underwear... and a camera for posterity, and some witnesses to come point-and-laugh at the perp.
    nicbou9 months ago
    This is something I frequently point out to. If someone understood constent like tech companies do, they'd be banned from a few bars. Look up "rules for consent" and think about how consensual your relationship with tech companies is.
    Here's one: https://stopsexualviolence.iu.edu/policies-terms/consent.htm...
    Obscurity43409 months ago
    It mirrors the rest of society's lack of understanding of consent. Sunrise, sunset
  - stonogo9 months ago
    They learned from Google, who to this day requires you to suffix your wifi network name with _NOMAP if you do not want it to be used by their mapping services.
- htrp9 months ago
  Sounds like they want photographers to do the data labeling for them....
- echelon9 months ago
  Insofar as data for diffusion / image / video models are concerned, the rise of synthetic data and data efficiency will mean that none of this really matters anyway. We were just in the bootstrapping phase.
  You can bolt on new functional modules and train them with very limited data you acquire from Unreal Engine or in the field.
  - toddmorey9 months ago
    I don’t entirely agree. For example, it’s a very popular scheme on Etsy right now to use LLMs to generate posters in the style of popular artists. Any artist should be able to say hey I don’t want my works to be part of your training set to power derivative generations.
    And I think it should even apply retroactively so that they have to retrain their models that are already generating works from training data consumed without permission. Of course, OpenAI would fight that tooth & nail but they put themselves in this position with a clear “take first ask permission later” mentality.
    pj_mukh9 months ago
    Dumb question: Why does Etsy allowed clearly reproduced/copied works? AI or not.
    Like selling it for money seems like a clear line crossed, and Etsy is the perfect gatekeeper here.
    jsheard9 months ago
    Etsy stopped caring a while ago, it was supposed to be a marketplace specifically for selling handmade items but they allowed it to be overrun with mass produced tat dropshipped direct from the factory. Turning a blind eye to plagiarism with or without AI is just the next logical step from there.
    9 months ago
    undefined
    girvo9 months ago
    > Why does Etsy allowed clearly reproduced/copied works
    They don't, in that they'll ban you for it once you're big enough
    protocolture9 months ago
    Style isnt protected?
    tomrod9 months ago
    Impossible to put the toothpaste back in the tube.
    forgetfreeman9 months ago
    A motivated legislature with skilled enforcement personnel could get the toothpaste back in the tube in short order provided they displayed an anomalous insensitivity to making the money sad.
    botanical769 months ago
    When has something like this ever happened? It feels like legislature exists to make money happy.
    Terr_9 months ago
    One big example involves making a lot of US plantation-money sad and scared, so much that it started a civil-war.
    Granted, that was more the exception than the rule...
    maeil9 months ago
    > When has something like this ever happened?
    Anything that used to be freely available but no longer is. Once upon a time Laudanum (tincture of opium) used to be the OTC painkiller of choice. In slightly more recent times, there's asbestos. In certain locales, gambling. There's countries that have reigned in lootboxes.
    > It feels like legislature exists to make money happy.
    Come on now, it doesn't just "feel" that way, you know for a fact that is indeed the purpose of the modern US legislature.
    tomrod9 months ago
    A single consumable is easy to ban. Computer binaries widely distributed isn't.
    maeil9 months ago
    Doesn't seem too different from the lootboxes example.
    llm_trw9 months ago
    Should any artist be able to tell another artist: hey don't copy my work when you're learning, I don't want competition?
    It seems like they are deeply upset someone has figured out a way for a machine to do what artists have been doing since time immemorial.
    aithrowawaycomm9 months ago
    There are two major differences between art generators and human artists:
    1) human artists are legal persons and capable of being held liable in civil court for copyright infringement; having a machine with no legal standing do the copyright infringement should be forbidden because it is difficult to detect, impossible to avoid, and a legal nightmare to unravel.
    2) human artists are capable of understanding what flowers, Jesus on the cross, waterfalls, etc actually are, whereas DALL-E is much dumber than a lizard and not capable of understanding these things, so using the verb "learning" to describe both is extremely misleading. DALL-E is a statistical process which is barely more sophisticated than linear regression compared to a human brain. It is plain wrong to say stuff like this:
    > It seems like they are deeply upset someone has figured out a way for a machine to do what artists have been doing since time immemorial.
    when nobody has even come close to figuring that out! If DALL-E worked like a human artist it would know what a bicycle is: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... But it doesn't. It is a plagiarism machine that knows how to match "bicycle" with millions images having a "bicycle" tag, and uses statistics to smooth things together.
    JambalayaJimbo9 months ago
    Llms are not humans and shouldn’t be anthropomorphized as a strategy to get around copyright infringement.
    protocolture9 months ago
    Absolutely correct, LLMs are code. If that code ingests data without precisely replicating it, that's fair use and the end of the discussion.
    Terr_9 months ago
    And if they do get anthropomorphized... then the people in charge of that company need to be charged with the heinous crime of enslaving children.
    econ9 months ago
    I've been uhh.. suffering(?) a different perspective. A company may hire a human to talk for it or its owner may talk for it.. the server rack and the software are things it can own. I don't think others are nesasarly any longer talking for it.
    It is an absurd leap we've made but companies are also legal persons.
    The companies are still of human design full of human behaviour and human characteristics while the LLMs actively try to imitate humans.
    The dictionary saying: anthropomorphized: attribute human characteristics or behaviour to (a god, animal, or object).
    If it passes the Turing test surely anthropomorphizing is fair game?
    (I have no stake in this)
    llm_trw9 months ago
    That's an opinion that will be tested in a court soon enough.
    zdragnar9 months ago
    My mom said the teachers in her painting classes would have the students recreate works and were very clear on which artists had given permission for those derivative works to be sold. Others they could only admire at home.
    The problem is not "when learning", the problem is "when distributing". Courts will determine whether or not disseminating or giving access to a model trained on protected works counts as distributing protected derivative works or not.
    Terr_9 months ago
    > The problem is not "when learning", the problem is "when distributing".
    Technically making a copy to bring home for your own use is also problematic, just much less likely to get you into trouble. (Still a step removed from learning the skills and technique of making a copy, however.)
    mitthrowaway29 months ago
    This analogy seems to be made every time this comes up on HN, but I don't think it really holds water. First of all, when a human artist learns from another, it's inherently a level playing field for competition; the junior and senior are both human, neither are going to be 1,000,000x more productive as the other. So the senior artist really doesn't have that much to worry about. And the senior artist recognizes that they themselves were once a junior and had to learn from their seniors, so it's a debt paid forward which results in more art. And becoming a master artist or even an imitator takes dedication and hard work, even with lots of artwork to learn from, so that keeps competition to a certain level.
    When it takes decades to develop an art style that a machine can copy in days, and then churn out derivative variations in seconds, it's no longer a level playing field. The machine can dramatically under-cut the artist who developed their style, much more than a copycat human artist could. This does become not just a threat to the livelihoods of artists, but also a disincentive to the development of new art styles.
    In this case, patent law may be an apt comparison for the world we're entering. Patent law was developed with the idea in mind that it is a problem if a human competitor could simply take an invention, learn how it works, and then mass produce copies of it. There are several reasons for this, including creating an incentive for technology development, and also expediently transitioning IP to the public domain. But patents were added to the legal system basically because otherwise an inventor would not be on a level playing field with the competition, because it takes so many more resources to develop a new invention than to produce clones.
    Existing IP law was built in a world where it was believed that machines were inherently incapable of learning and mass-producing new artistic works using styles learned from artists. It was not necessary to protect artists from junior artists learning how to work in their style, as long as it wasn't a forgery. But in a world of machine learning, perhaps we will decide it's reasonable to protect artists from machine copycats, just like we decided it was reasonable protect technology inventors from human copycats.
    The patent system is not the right implementation; it's expensive to file a patent, and you need skilled lawyers to determine novelty, infringement, and so on. But for art and machine learning, it might be much simpler: a mandatory compensation for artists' work used as training data. Something like this is sometimes used in the music industry to determine royalties for radio broadcasting, or to account for copies spread by file sharing.
    richardw9 months ago
    Surely all this applies to code and the written word in general?
    People allowed (and encouraged) read access to websites so Google would index and link. Now Google et al summarise and even generate. All of that is built on our collective output. Surely everyone deserves a cut? The free sharing licenses that were added to repos didn’t account for LLM’s, so we should revisit it so all creators get their dues, not just those who traditionally got paid.
    mitthrowaway29 months ago
    (Yes, for what it's worth I agree with this!)
    9 months ago
    undefined
    9 months ago
    undefined
    ncallaway9 months ago
    When a human does it, it's fine.
    When OpenAI's servers do it its copyright infringement.
    We don't apply copyrights to human brains, but we do apply copyright to computer memory.
    65109 months ago
    not sure if to pull out my racism card or to help burn the printing press.
    protocolture9 months ago
    I dont know why you are getting downvotes, you are absolutely correct.
    A training method for some authors who want to adopt an older artists voice is to literally rewrite their novels. Word for word. They will go through an entire authors catalogue and reproduce them, so that they can learn to mimic them when creating something new.
    You go ahead and automate the process, and suddenly the world is ending.
    Ditto all other kinds of art. Heck I knew of 3 living artists doing this to each other in real time.
    llm_trw9 months ago
    HN is not only filled with people who never learned how to code, it's filled with people who never learned how to write either.
    Hunter Thompson literally sat down and typed out every word of Hemingway's novels so he could figure out what good writing feels like.
    Why is he allowed to do it in private, but an LLM isn't?
    benwad9 months ago
    Because Hunter S Thompson is a talented writer, and an LLM is a statistical method for generating text that looks like other text. An LLM isn't going to ingest all of Hemingway and write Hell's Angels.
    llm_trw9 months ago
    >You must be this talented for fair use to apply to you.
  - simonw9 months ago
    Has synthetic data become a big part of image/video models?
    I understand why it's useful and popular for training LLMs, but I didn't think it was applicable to generative image/video work.
    llm_trw9 months ago
    I haven't had the chance to train diffusion models but for detection models synthetic data is absolutely how you get state of the art performance now. You just need a relatively tiny extremely high quality dataset to bootstrap from.
    9 months ago
    undefined
  - toddmorey9 months ago
    For clarity, I do agree that synthetic data is huge for training AI to do certain tasks or skills. But I don’t think creative work generation is powered by synthetic data and may not be for a quite while.
  - numpad09 months ago
    Isn't that just weird cope? I mean, why not just LLM automate UE if that's the goal & how isn't that itself going to get torpedoed by Epic?
oraphalous9 months ago
I don't even understand why it's everyone elses problem to opt-out.
Eventually - for how many of these AI companies would a person have to track down their opt-out processes just to protect their work from AI? That's crazy.
OpenAI should be contacting every single one and asking for permission - like everyone has to in order to use a person's work. How they are getting away with this is beyond me.
- munchler9 months ago
  Copyright doesn't prevent anyone from "using" a person's work. You can use copyrighted material all day long without a license or penalty. In particular, anyone is allowed to learn from copyrighted material by reading, hearing, or seeing it.
  Copyright is intended to prevent everyone from copying a person's work. That's a very different thing.
  - soared9 months ago
    There is an argument to be made that ChatGPT mildly rewording/misquoting info directly from my blog is copying.
    Aeolun9 months ago
    And it is. And you can sue them for that. What you can’t do is get upset they (or their AI) read it.
    munchler9 months ago
    Sure, but that's a different claim and a different argument.
    9 months ago
    undefined
    amiantos9 months ago
    I think to make that argument you would need evidence that someone prompted ChatGPT to reword/misquote info directly from your blog, at which point the argument would be that that person is rewording/misquoting info directly from your blog, not ChatGPT.
    Terr_9 months ago
    I don't think so: The user is merely making a request for copyrighted material, which is not itself infringing, even if their request was extremely specific and their intent was obvious.
    OpenAI would be the company actually committing the infringement and providing the copy in order to satisfy the request.
    If the law suddenly worked the other way around, companies would no longer be able to prosecute people for hosting pirated content online, because the responsibility would lie with the users choosing to initiate the download.
    jillesvangurp9 months ago
    That would fall under fair use.
    Legally, you'd struggle to prove any form of infringement happened. Making a copy is fine. Distributing copies is what infringes. You'd need to prove that is happening.
    That's why there aren't a lot of court cases from pissed off copyright holders with deep pockets demanding compensation.
  - 23B19 months ago
    > Copyright doesn't prevent anyone from "using" a person's work.
    It should. The 'free and open internet' is finished because nobody is going to want to subject their IP to rampant laundering that makes someone else rich.
    Tragedy of the commons.
    munchler9 months ago
    I can see this both ways. For the sake of argument, please explain why using IP to train an AI is evil, but using the same IP to train a human is good.
    Note that humans use someone else's IP to get rich all the time. E.g. Doctors reading medical textbooks.
    simion3149 months ago
    >Note that humans use someone else's IP to get rich all the time. E.g. Doctors reading medical textbooks.
    You need a better example, a textbook was created with this exact purpose of sharing knowledge with the reader.
    My second point, if you write a poem and I read it and memorize it, then publish it as my own with some slight changes you would be upset?
    If I get your painting, then use a script to apply a small filter to it then sell it as my own, is this legal? is my script "creative"?
    This AIs are not really creative, they just mix inputs and then interpolate an answer , is some cases you can't guess what input image/text was used but in other cases it was shown ezactly the source that was used and just copy pasted in the answer.
    Ukv9 months ago
    > My second point, if you write a poem and I read it and memorize it, then publish it as my own with some slight changes you would be upset?
    I feel the problem with analogizing to humans while trying to make a point against unlicensed machine learning is that applying the same moral/legal rules as we do to humans to generative models (learning is not infringement, output is only infringement if it's a substantially similar copy of a protected work, and infringement may still be covered by fair use) would be a very favorable outcome for machine learning.
    > they just mix inputs and then interpolate an answer , is some cases you can't guess what input image/text was used
    Even if you actually interpolated some set of inputs (which is not how diffusion models or transformers work), without substantial similarity to a protected work you're in the clear.
    > is my script "creative"? [...] This AIs are not really creative [...]
    There's no requirement for creativity - even traditional algorithms can make modifications such that the result lacks substantial similarity and thus is not copyright infringement, or is covered by fair use due to being transformative.
    simion3149 months ago
    >I feel the problem with analogizing to humans while trying to make a point against unlicensed machine learning is that applying the same moral/legal rules as we do to humans to generative models (learning is not infringement, output is only infringement if it's a substantially similar copy of a protected work, and infringement may still be covered by fair use) would be a very favorable outcome for machine learning.
    Agree. copyright is clear, so if I can make ChatGPT output copyrighted material then Open AI should pay me correct? Or you will claim that this is rare, a mistake and we should forgive OpenAI while a human would have had to pay damages.
    Ukv9 months ago
    > so if I can make ChatGPT output copyrighted material then Open AI should pay me correct?
    If by "make" you mean you're coaxing it into outputting your work, it'd be difficult to allege damages. If you show it's regurgitating your registered work to normal users, and it's not covered by fair use factors (e.g: it's outputting a significant portion of your work, in a non-transformative manner, and this is negatively impacting the market for that work), then you'd have a good case to bring.
    > Or you will claim that this is rare, a mistake and we should forgive OpenAI
    Rarity will affect damages, but they wouldn't be off the hook if such a situation does happen. To my knowledge no safe harbor applies here, given it's their own bot and not human users.
    Larrikin9 months ago
    Is the AI allowed to decide unprompted how to spend the money? Can it decide that it doesn't like the people who made it and donate it to charity. Can the AI start it's own company and not hire anyone that made it? Can the AI decide that it prefers the open Internet and will answer all questions for free?
    The sake of argument is a cowards way of expressing an unpopular opinion in public. Join a debate club if you're actually being genuine.
    23B19 months ago
    I never used the word evil.
    That said, machines don't have natural rights, and you don't get to use them to violate mine.
    23B19 months ago
    scale
    amiantos9 months ago
    Under this mentality, every search engine index would be shut down.
    23B19 months ago
    cool
- griomnib9 months ago
  Napster had a moment too, but then they got steamrolled in court.
  Courts are slow, so it seems like nothing is happening, but there’s tons of cases in the pipeline.
  The media industry has forced many tech firms to bend the knee, OpenAI will follow suit. Nobody rips off Disney IP and lives to tell the tale.
  - tiahura9 months ago
    If your business model depends on the Roberts' court kneecapping AI, pivot. Training does not constitute "copying" under copyright law because it involves the creation of intermediate, non-expressive data abstractions that do not reproduce or communicate the copyrighted work's original expression. This process aligns with fair use principles, as it is transformative, serves a distinct purpose (machine learning innovation), and does not usurp the market for the original work.
    paranoidrobot9 months ago
    I believe there are some other issues other than just "is it transformative".
    I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".
    Similarly, I can't take a sample of a Taylor Swift song and use it myself in my own music - I have to give Taylor credit, and probably some portion of the revenue too.
    There's also still the issue that some LLMs and (I believe) image generation AI models have regurgitated works from their training models - in whole or part.
    protocolture9 months ago
    >I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".
    If you dont replicate Warhols painting entirely, then you are fine. Its original work.
    The number of Scifi novels I read that are just an older concept reimagined with more modern characters is huge.
    >I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".
    In most sane jurisdictions you can sample other work. Consider collage. It is usually a fair use exemption outside of the USA. If LLMs cause keyboard warriors to develop some seppocentric mindvirus leading to the destruction of collage I will be pissed.
    >There's also still the issue that some LLMs and (I believe) image generation AI models have regurgitated works from their training models - in whole or part.
    Considered a high priority bug and stamped out. Usually its in part because a feature is common to all of an artists work, like their signature.
    dwallin9 months ago
    > I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work.
    This is a hilarious choice of artist given that Warhol is FAMOUS for appropriating work of others without payment, modifying it in some way, and then turning around and selling it for tons of money. That was the entire basis of a lot of his artistic practice. There was even a Supreme Court case about it.
    mitthrowaway29 months ago
    There was a time when it did not usurp the market for the original work, but as the technology improves and becomes more accessible, that seems to be changing.
    toddmorey9 months ago
    In my experience when existing laws allow an outcome that causes enough significant harm to groups with influence, the laws gets changed.
    23B19 months ago
    > Training does not constitute "copying" under copyright law
    It should.
    caeril9 months ago
    [flagged]
    23B19 months ago
    [flagged]
    caeril9 months ago
    [flagged]
    23B19 months ago
    [flagged]
  - llm_trw9 months ago
    And yet Micky Mouse is in the public domain. Something those of us who remember the 90s thought would never happen.
    timcobb9 months ago
    Just the oldest Mickey. They gave up on it because the cost/benefit wasn't deemed worth it anymore.
- CamperBob29 months ago
  I don't even understand why it's everyone elses problem to opt-out.
  Because the work being done, from the point of view of people who believe they are on the verge of creating AGI, is arguably more important than copyright.
  Less controversially: if the courts determine that training an ML model is not fair use, then anyone who respects copyright law will end up with an uncompetitive model. As will anyone operating in a country where the laws force them to do so. So don't expect the large players to walk away without putting up a massive fight.
  - SketchySeaBeast9 months ago
    Of note here is the reason it's "important" is it will make a shit-ton of money.
    CamperBob29 months ago
    That, coupled with the obvious ideological motivations. Success could alter the course of human history, maybe even for the better.
    If you feel that what you're doing is that important, you're not going to let copyright law get in the way, and it would be silly to expect you to.
    SketchySeaBeast9 months ago
    I can't say I believe that. If that were the case, they'd focus more on results and less on hyping up the next underwhelming generation.
    CamperBob29 months ago
    For one thing, they are focused on money because they need lots of it to do what they're doing.
    For another, the o1-pro (and presumably o3) models are not "underwhelming" except to those who haven't tried them, or those who have an axe to grind. Serious progress is being made at an impressive pace... but again, it isn't coming for free.
    2muchcoffeeman9 months ago
    Oh please. OpenAI and I guess every other AI company are for-profit.
    The only change they are motivated by is their bank balances. If this were a less useful tool they’d still be motivated to ignore laws and exploit others.
    CamperBob29 months ago
    Hard to say what motivates them, from the outside looking in. There have been signs of cultlike behavior before, such as the way the rank and file instantly lined up behind Altman when he was fired. You don't see that at Boeing or Microsoft.
    Obviously it's a highly-commercial endeavor, which is why they are trying so hard to back away from the whole non-profit concept. But that's largely orthogonal to the question of whether they feel they are doing things for the benefit of humanity that are profound enough to justify blowing off copyright law.
    Especially given that only HN'ers are 100% certain that training a model is infringement. In the real world, this is not a settled question. Why worry about obeying laws that don't even exist yet?
    maeil9 months ago
    > Hard to say what motivates them, from the outside looking in.
    It isn't.
    > There have been signs of cultlike behavior before, such as the way the rank and file instantly lined up behind Altman when he was fired.
    This only reinforces that the real drive is money.
    2muchcoffeeman9 months ago
    >Especially given that only HN'ers are 100% certain that training a model is infringement. In the real world, this is not a settled question. Why worry about obeying laws that don't even exist yet?
    This is exactly why people are against it.
    Your argument is that there is no definitive law. Therefore the creators of the data you scrape to train, and their wishes are irrelevant.
    If the motivation was to help humanity, they’d think twice about stepping on the toes of the humanity they want to save and we’d hear more about nontrivial uses.
    CamperBob29 months ago
    Your argument is that there is no definitive law. Therefore the creators of the data you scrape to train, and their wishes are irrelevant.
    Correct, that is the position of the law. Here in America, we don't take the position, held in many other countries, that everything not explicitly permitted is forbidden. This is a good thing.
    If the motivation was to help humanity, they’d think twice about stepping on the toes of the humanity they want to save
    Whether it is permissible to train models with copyrighted content is up to the courts and Congress, not us. Until then, no one's toes are being stepped on. Everybody whose work was used to train the models still holds the same rights to that work that they held before.
    2muchcoffeeman9 months ago
    >Until then, no one's toes are being stepped on. Everybody whose work was used to train the models still holds the same rights to that work that they held before.
    And yet artists don’t feel like their work should be used for training.
    I’m not sure how you can argue that the intentions are unknowable, when clearly you and the AI companies don’t care about the people whose work they have to use to train their models and these people’s wishes. Motivation is greed.
    CamperBob29 months ago
    And yet artists don’t feel like their work should be used for training.
    The law isn't really all that interested in how "artists feel." Neither am I, as you've surmised. The artists don't care how I feel, so it would be kind of weird for me to hold any other position.
    In any case, copyright maximalism impoverishes us all.
- paulcole9 months ago
  > OpenAI should be contacting every single one and asking for permission - like everyone has to in order to use a person's work
  This is the problem of thinking that everyone “has” to do something.
  I assure you that I (and you) can use someone else’s work without asking for permission.
  Will there be consequences? Perhaps.
  Is the risk of the consequences enough to get me to ask for permission? Perhaps.
  Am I a nice enough guy to feel like I should do the right thing and ask for permission? Perhaps.
  Is everyone like me? No.
  > How they are getting away with this is beyond me.
  Is it really beyond you?
  I think it’s pretty clear.
  They’re powerful enough that the political will to hold them accountable is nonexistent.
  - farrarstan9 months ago
    [dead]
griomnib9 months ago
I think it’s safe to assume anything Sam A says is an outright lie by now.
- maeil9 months ago
  It's depressing that this understanding hasn't been the status quo for years now. It's not like this is his first gig, it's been publicly verifiable what kind of person he is for ages, long before GPT became famous. You don't need to be part of some insider Silicon Valley cabal to find out.
  - hashxyz9 months ago
    Can you back that up with anything? I’ve gotten this as a vague sense, but it seems hard to find much actual background about how he manages to continuously fail upward.
- 9 months ago
  undefined
dgfitz9 months ago
Eventually the headline will be the first 2 words.
The tech is neat, there is value in a sense, LLMs are a fun tech. They are not going to invent AGI with LLMs.
- wilg9 months ago
  who cares if they do it with LLMs or not? how do you define agi?
  - mschuster919 months ago
    > how do you define agi?
    An AI that has enough sense of self-awareness to not hallucinate and to recognize the borders of its knowledge on its own. That is fundamentally impossible to do with LLMs because in the end, they are all next-token predictors while humans are capable of a much more complex model of storing and associating information and context, and most importantly, develop "mental models" from that information and context.
    And anyway, there are other tasks than text generation. Take autonomous driving for example - a driver of a car sees a person attempting to cross a street in front of them. A human can decide to slam the brake or the gas depending on the context - is the person crossing the car some old granny on a walker or a soccer player? Or a human sees a ball being kicked into the air on the sidewalk behind some cars, with no humans visible. The human can infer "whoops, there might be children playing here, better slow down and be prepared for a child to suddenly step out onto the street from between the cars", but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.
    og_kalu9 months ago
    >Take autonomous driving for example - a driver of a car sees a person attempting to cross a street in front of them. A human can decide to slam the brake or the gas depending on the context - is the person crossing the car some old granny on a walker or a soccer player? Or a human sees a ball being kicked into the air on the sidewalk behind some cars, with no humans visible. The human can infer "whoops, there might be children playing here, better slow down and be prepared for a child to suddenly step out onto the street from between the cars"
    These are just post-hoc rationalizations. No-one making those split-second decisions under those circumstances has those chains-of-thoughts. The brain doesn't 'think' that fast.
    >but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.
    We're talking about LLMs right ? They can make these sort of inferences.
    https://wayve.ai/thinking/lingo-2-driving-with-language/
    onemoresoop9 months ago
    You’re one of those who think the human brain is just an LLM?
    It could be possible to use LLMs to build a rube goldberg type of brain or somethingt hat will mimic a human brain but it will have the same flaws LLMs have and will never reach parity with humans. I think AGI is possible but we’re too focused on LLMs to get there yet.
    wilg9 months ago
    again i don't care whether its done with an LLM or not. there's no reason to think openai will only build LLMs. recognizing borders of its knowledge is a reasonable thing to include in an agi definition i suppose, but does not seem intractable.
    for the second one, ai drivers like tesla's current version is already skipping the object detection/classification and instead uses deep learning on the entire video frame and could absolutely use the ball or any other context to change behavior, even without the particular internal monologue describe there.
    onemoresoop9 months ago
    I haven’t seen any new sparks of intelligence. But it remains to be seen what OpenAi does. So far I haven’t seen any paradigm shifts or indications they’re not just scaling up and making their training corpus more vast. I could be wrong but if they had something we’d know by know. Every chatgpt release has been hyped up but somewhat dissapointing to many. But what do I know ..
    PittleyDunkin9 months ago
    > An AI that has enough sense of self-awareness to not hallucinate
    It's not entirely clear that this is meaningful. Humans engage in confabulation, too.
    onemoresoop9 months ago
    Humans engage in confabulation but they’re mostly aware of it. In some mental disorders they may not be aware; though statistically that is not too significant and no, we normally don’t confabulate as much as the current crop of AI aka LLMs.
    As a tool LLMs are fantastic and am glad to look at them as solely as powerful tools. AGI is not here yet and maybe that’s a good thing. Who would want some kind of artificial intelligence that is capable of understanding us, that is capable of using psychological tricks on people, that could have different goals than us and so on.
    wilg9 months ago
    https://www.medicalnewstoday.com/articles/confabulation#vs-l...
    > Confabulators are usually unaware they are providing false information. They often display genuine surprise or confusion when evidence of facts contradicts their statements.
    This is similar to LLMs actually. But it also seems like various "System 2" things like chain of thought could compensate for this issue in the LLM (and that possibly that is similar to how the brain works).
    PittleyDunkin9 months ago
    > Humans engage in confabulation but they’re mostly aware of it.
    I'm not sure this is the case at all. Some awareness of this doesn't imply full awareness. In my experience, most people are unaware of how incoherent their worldviews are, so the distinction between normative and confabulatory behavior isn't clear.
  - portaouflop9 months ago
    We have this discussion every minute -.-
  - lm284699 months ago
    I care because it's brought to us by same same deranged brains who promised self driving cars "in two years" every year since 2012, and a fully autonomous mars city "by 2030".
    We're all wasting time and resources on what basically amounts to alchemy while we could tackle real problems.
    Tech solutionists keep making promises for the next 5-10-20 years and never deliver, AI, electric planes, clean fuel, autonomous cars, the meta verse, brain implents. You'd expect the internet would have made people smarter but we fall for the same snake oil as 100 years ago, en masse this time
    wilg9 months ago
    i mean there’s progress on all those things and that’s good and there’s no downside really?
  - goatlover9 months ago
    Whatever makes Open AI enough money?
  - dgfitz9 months ago
    … very carefully?
hnburnsy9 months ago
Maybe the task to implement it was scheduled by ChatGPT...
https://news.ycombinator.com/item?id=42716744
- Bilal_io9 months ago
  Sorry the task failed for unknown reasons.
thrance9 months ago
Another one of these daily reminders that we live in a two-tiered justice system, everything you ever created is fair game to them, but don't you dare use a leak of their weights lest you want to be thrown in jail.
- jsheard9 months ago
  According to OpenAI you're not even allowed to use GPT output to train a competing model, so they believe that AI models are the only thing worthy of protection from being trained on. Llama used to have a similar clause, which was partially walked back to "you must credit us if you train on Llama output" in later versions, but that's still a double standard since they don't credit anything that Llama was trained on. For obvious reasons now we know that Zuck personally greenlit feeding it pirated books.
  - umeshunni9 months ago
    Well that hasn't stopped Deepseek.
    pton_xd9 months ago
    Honestly, good for them. This whole "we can use your output for our input, but don't even think about doing the same" is just absurd.
DidYaWipe9 months ago
Shocking news about the company that fraudulently left "open" in its name after ripping off donors.
I think the headline is too generous here. More accurate would be "OpenAI neglects to deliver opt-out system..."
- HeatrayEnjoyer9 months ago
  Sorry, who did they rip off?
  All their investors stand to profit handsomely (if they live).
  - hansvm9 months ago
    They ripped off everyone they lied to. The took money under the premise that they'd put humanity first as this AI transition happened (both in safety and in knowledge sharing), and they instead used that money to build a moat that'll make it harder for anyone else to accomplish those same goals. Investors in the original vision would have been better off had they not contributed any funds, and the monetary profit they're receiving in exchange won't be enough to offset those damages (in the sense that it's not enough to fund somebody attempting to execute the same mission now that OpenAI exists -- at least not with the same chance of success they anticipated when OpenAI was younger).
  - DidYaWipe9 months ago
    Are you saying that donors to their "non-profit" received shares in the now-for-profit enterprise?
    And if so, do you have a citation for that?
    DidYaWipe9 months ago
    <crickets>
devit9 months ago
Aren't lawsuits the proper way to address this?
Seems like there's an argument that model weights are a derivative work of the training data, at least if the model is capable of producing output that would be ruled to be such a derivative work given minimal prompting.
Although it may not work with photography since the model might just almost exclusively learn how the object of the photo looks in general and how photos work in general, rather than memorizing anything about specific photos.
- fenomas9 months ago
  I think that argument falls down though, because a derivative work is an expressive work in its own right, and model weights aren't.
  It would seem more coherent to argue that a model output could be a derivative work, though it would need to include a significant portion of some given source. But even then, since the copyright office's position is that they're not copyrightable, I'm not sure they could qualify.
  - devit9 months ago
    Model weights, if they can reproduce something like the original, are just a form of lossy compression (or even lossless for text), where the LLM answering the prompt is a more powerful version of asking software to retrieve a specific file from a Zip archive (or a webserver answering an HTTP query) of such lossy compressed data.
    So if model weights don't infringe, that would also imply that saving an image as a JPG or a video using AV-1 doesn't infringe, which would obviously effectively implies that copyright doesn't apply to images or videos on the web, which is not current law/policy, so I think that reasoning cannot possibly work.
    fenomas9 months ago
    That comparison would only make sense if compressed images were considered derivative works. They're not - copyright doesn't protect bytes on a disk, it protects creative expressions. Lossy compression doesn't affect the creative expression, so in copyright terms a compressed JPG is just a copy, and is covered exactly like the original image.
    In contrast a derivative work is one creative expression that contains elements of another - like when you take an image and add commentary, or draw your own addition onto it, etc. And I'm pointing out that a trained model is not that - it's not itself a copyrightable expressive work. (We could think of it as a kind of algorithm for generating works, but algorithms aren't copyrightable.)
    devit9 months ago
    Well then the model weights would be a compilation of copies of the original works, which has the same effect as it being a derivative work unless the copyright holder chose to allow copies but not derivative works.
econ9 months ago
In my mental imagery this is a situation that any advancing civilization in the universe should eventually run into. There will be all kinds of materials from the laborious and expensive to the effortless and "I was the first" or some other entitlement. It all boils down to having or not having such automatons. I'm sure there have been plenty who, like us with our books, have successfully denied progress. I'm also sure there have been plenty where it was completely obvious to upload the entire database of ET knowledge.
It is equally obvious what the later gained and the former lost in the process.
We, with our books, have successfully prevented people from educating themselves with amazing implications. Now the challenge is to create equally impotent machines!
You have no further questions:)
- econ9 months ago
  The brain chip is just one more interface. I feel the need to remind the younglings that in the time before the internet we talked a lot and talked about whatever we wanted to. Moderation was done by the speaker himself by knowing people. Imagine that! It sometimes got emotional or violent but that is an important part of communication. Looking at your watch was somewhat of an insult as if the other failed to be interesting enough. Today no one is interesting enough to talk in long form, few remember how
  Now imagine direct thought moderation. After all, thoughts belong to people? I thought it first? You can't just... It is clear we should control your thoughts. We can't just have you think random things. It would be like like TikTok! Or like reading books!Terrifying!
  We are quite used to the man behind the curtain deciding everything for us. At what point would the deal get to absurd I wonder? Would 1984 eventually become a really boring book? Would it exist at all? Would people save up social credits to read it?
  Other civilizations must have tried all possible variations with rather predictable results. To a free mind I mean.
  Or are we already puppets on a string? How much am I boring you with this? Should I be allowed?
- DaiPlusPlus9 months ago
  > We, with our books, have successfully prevented people from educating themselves with amazing implications
  Que?
  No, really... what?
  - econ9 months ago
    I mean how we, in stead of setting the books free, keep them in cages and sell tickets.
    It seems to me any civilization in the history of the cosmos will inevitably reach a stage where they have choose to make knowledge available in order to solve problems.
    One should only have to type the title of a book then get to browse around for a bit. Send a link to someone etc
    Anything else is suicidal nonsense.
    Tax hard working people to pay to defend dead peoples pixels from copying?
    No one knows who or what an author is if there even is one. If I generate or write by hand all word combinations I don't get to own them.
    Enforcement is much to expensive for normal people if one even notices the copying. They just get to pay for it.
    An elaborate scheme in order to not solve problems, not innovate and not progress.
Terr_9 months ago
"By continuing, you agree that using any content from this site in training Generative AI grants the site-owner a perpetual, irrevocable, and royalty-free license to use and re-license any and all output created by that Generative AI system, including but not limited to derivative works based on that output."
Just just a GPL-esque idea I've been musing lately [0], I'd appreciate any feedback from actual IP lawyers. The idea is to add a poison-pill, and if a company "steals" your content for their own profit, you can strike back by making it very hard for them to actually monetize the results. Since it's a kind of contract, it doesn't rely on how much work seems to be surfacing in a particular output.
So supposing ArtTheft Inc. snarfs up Jane Doe's paintings from her blog, she--or any similar victim--can declare that they grant the world an almost-public license to anything ArtTheft Inc. has made based on that model. If this happens ArtTheft Inc. could still make some money selling physical prints, but anyone else could undercut them with free or cheaper copies.
[0] https://news.ycombinator.com/item?id=42582615
- amiantos9 months ago
  That's cute, I'm going to put that at the start of any creative work I make so that anyone who sees it owes me a license to everything they ever made afterward because a nugget of their life experience legally belongs to me now and all their creative works are now tainted by that.
  - Terr_9 months ago
    I didn't expect you to admit you were a computer program so readily. :p
    Do you have any more-substantive critique? It sounds like you're trying to argue that the terms will be found to be unconscionable. However it's not asking for any payment, or even any effort-taking action: it's just saying that the site-owner provides content on the condition that if you incorporate that content into a generative product, the site-owner gets to use the results too. Clearly the people hoovering up training-data believe my work has some economic value to themselves, or they wouldn't be running a giant web crawler hitting every page of the blog--it's not as if they're arriving out of boredom, or because they followed some opaque hyperlink in curiosity.
- protocolture9 months ago
  I dont think you can exclaim away fair use protections. Otherwise everyone already would.
  - Terr_9 months ago
    > I dont think you can exclaim away fair use protections.
    "Copyright doesn't stop me from X" is different from "copyright lets me do X even though I agreed to a contract saying I wouldn't." (I have many problems with modern click/shrink-wrap, but that's a whole 'nother can of worms and I'm just trying to "fight fire with fire" here.)
    If the average ToS has no force, then HN is currently infringing on my copyright by showing this post to you.
    protocolture9 months ago
    Theres no valid TOS provided to a human to read when I am scraping the entire internet. I dont think wget can sign a contract?
    Has anyone managed to hit Google or Yahoo with a TOS violation?
    Terr_9 months ago
    Unclear.
    There was hiQ Labs v. LinkedIn but that focused on whether it was unauthorized access under the CFAA.
    In X Corp. v. Bright Data Ltd., a quick skim suggests ExTwitter's ToS lost because (A) it wasn't really the owner of the content and (B) they couldn't easily show harm.
    IANAL again, but for personal blog, (A) is unlikely to apply, and (B) could be shown if ArtTheft Inc. starts causing legal fees by threatening the blogger for exercising the re-licensing in the ToS.
Der_Einzige9 months ago
Good.
Everyone gets big mad when someone with money acts like Aaron Swartz did. The only bad thing about OpenAI is that they're not actually open sourcing or open accessing their stuff. Mistral or Llama "training on pirated material" is literally a feature, not a bug and the tears from all the artists and others who get mad are delicious. These same artists would profess literal radical marxism but become capitalist luddite copyright trolls the moment that the means of intellectual production became democratized against their will.
If you posted something on the internet, I can and will put it into ipadapter and take your style and use it for my own interests. You cannot stop me except by not posting it where I can access it. That is the burden of posting anything on the public internet. You opt out by not doing it.
- tehjoker9 months ago
  Such a weird argument. A company doing the same thing as Aaron Swartz is doing it for personal gain, not for our collective benefit.
- amiantos9 months ago
  It is comical to me how fast the anti-RIAA internet turned into a bunch of copyright maximalists who expect organizations like the RIAA to protect them in some way. In actuality, if someone manages to weaponize copyright against AI, it will only successfully be used by massive rights holders to extract payouts from AI companies and none of the money will be given to any of the creatives, and creatives will naturally still not be very happy about it. Spotify 1.0 is right holders streaming your content and paying you fractions of pennies for it, "Spotify 2.0" will be them licensing your content to AI companies and paying you a fraction of a fraction of a penny just once.
dadbod9 months ago
The tool was called "Media Manager" LMFAO. A name so uninspired it perfectly reflected how little they cared.
- grajaganDev9 months ago
  LOL - it sounds like something from 90's era Microsoft.
monomyth9 months ago
this is as retarded as asking someone to forget a picture they have seen
92834092329 months ago
People need to understand these companies are not good actors and will not let you opt out unless forced. I have a 20 dollar bet with a friend that Trump's admin will get training data classified as fair use and the whole issue will be done away with anyway
- protocolture9 months ago
  Its clearly fair use regardless of what trump does.
  - lm284699 months ago
    I don't think anything is clear when it comes to laws... Even 10 words sentences from the constitution have been debated since they exist.
  - 92834092329 months ago
    It isn't clearly anything since there are dozens of lawsuits about it going on right now.
    protocolture9 months ago
    Other than the bloke who tried to make his LLM the author, I havent seen any lawsuits uphold your point of view.
    There was a great article posted here last year rounding up all the various courts who upheld ownership for prompters of LLM output, including China (Possibly twice)
    If recombining data from images in a way that theres not a single trace of any original violates fair use, then fair use ceases to exist. There is hardly a fairer use. Any existing fair use outcome uses actual recognizable elements of the original work. There isn't really 2 directions on this. The damage that success for the anti ai folk would have against IP law is tremendous.
- dylan6049 months ago
  Apparently, Trump has a lot of training data stored in a bathroom, so there's that
testfrequency9 months ago
Probably means nothing, but all the people I know who went to OpenAI and are still there are all the people who made very poor business decisions and were hated at multiple companies I worked for.
High doubt any of them will be good stewards of anything but selfishness.
As for the others, they were all smart, passionate, dedicated folks who knew Sam was a complete narcissist and left to start their own AI startups.
(sorry mods, I’m upset and I’m annoyed OpenAI is getting away with murder of society in plain view)
passwordoops9 months ago
Give it a month and they should have no problem deploying their inevitable AGI to deliver the opt-out system, right? /S
- 9 months ago
  undefined
allsummer9 months ago
One should be able to opt-out for training AI, but then testing AI should also become impossible. Else you are freeloading just as much as you accuse the AI companies of.