199 pointsby onetokeoverthe2 days ago17 comments
  • toddmorey2 days ago
    No way OpenAI will ever “good citizen” this. Tools to opt out of training sets will only come if they are legally compelled. Governments will have to make respecting some sort of training preference header on public content mandatory I think.

    The fact that photographers have to independently submit each piece of work they wanted excluded along with detailed descriptions just shows how much they DONT want anyone excluding content from their training data.

    • andrei_says_2 days ago
      Reminds me of the time when p2p music sharing became popular and the record companies had to submit every song they did not want to get shared along with an explanation to every person who had Napster installed.

      Or was it that the record companies got to sue individuals for astronomic amounts of made up damages for every song potentially shared?

      Which one was it?

      • DaiPlusPlus2 days ago
        > and the record companies had to submit every song they did not want to get shared along with an explanation to every person who had Napster installed.

        ...when did that ever happen? The post-Napster-but-pre-BitTorrent era (coincidentally, the same time-period as the BonziBuddy-era) was when Morpheus, KaZaA, eDonkey, Limewire, et cetera were relevant, and they got-away-with-it, in-part, by denying they had any ability to moderate their users' file-sharing; there was no "submitting of every song" to an exclusion-list because there was no exclusion-list or filtering in the first place.

        • orra2 days ago
          This almost got me too, but you missed GP’s point. See their next paragraph beginning “Or”. (It's about double standards for individual versus corporate copyright infringement.)
    • recursivecaveat2 days ago
      100%, like most opt-outs this exists as a checklist feature that proponents can point to and hopefully convince bystanders. You muddy the waters by allowing someone to with great effort technically possibly achieve the thing they want, maybe, for now, until you close it in 2 years and everyone says "well that makes sense nobody used that feature anyways".
    • dylan6042 days ago
      > The fact that photographers have to independently submit each piece of work they wanted excluded along with detailed descriptions just shows how much they DONT want anyone excluding content from their training data.

      That's bloody brilliant. If you don't want us to scrape your content, please send us your content with all of the training data already provided so we will know not to scrape it if we come across it in the wild. FFS

      • nicbou2 days ago
        The tech industry’s understanding of consent is terrifying.
        • dylan6042 days ago
          Understanding is a curious choice of words. I’d have gone with total disregard
          • frizlab2 days ago
            Or even contempt for?
        • isoprophlex2 days ago
          Mirrors that of a sexual predator.

          "Oh I'm not groping you today? No worries, I'll be back tomorrow."

          • nicbou19 hours ago
            This is something I frequently point out to. If someone understood constent like tech companies do, they'd be banned from a few bars. Look up "rules for consent" and think about how consensual your relationship with tech companies is.

            Here's one: https://stopsexualviolence.iu.edu/policies-terms/consent.htm...

          • DaiPlusPlus2 days ago
            > "Oh I'm not groping you today? No worries, I'll be back tomorrow."

            the trick is to come back tomorrow, but with a rusty and jagged metal mousetrap hidden in one's underwear... and a camera for posterity, and some witnesses to come point-and-laugh at the perp.

        • Obscurity43402 days ago
          It mirrors the rest of society's lack of understanding of consent. Sunrise, sunset
      • stonogo2 days ago
        They learned from Google, who to this day requires you to suffix your wifi network name with _NOMAP if you do not want it to be used by their mapping services.
    • htrp2 days ago
      Sounds like they want photographers to do the data labeling for them....
    • echelon2 days ago
      Insofar as data for diffusion / image / video models are concerned, the rise of synthetic data and data efficiency will mean that none of this really matters anyway. We were just in the bootstrapping phase.

      You can bolt on new functional modules and train them with very limited data you acquire from Unreal Engine or in the field.

      • toddmorey2 days ago
        I don’t entirely agree. For example, it’s a very popular scheme on Etsy right now to use LLMs to generate posters in the style of popular artists. Any artist should be able to say hey I don’t want my works to be part of your training set to power derivative generations.

        And I think it should even apply retroactively so that they have to retrain their models that are already generating works from training data consumed without permission. Of course, OpenAI would fight that tooth & nail but they put themselves in this position with a clear “take first ask permission later” mentality.

        • pj_mukh2 days ago
          Dumb question: Why does Etsy allowed clearly reproduced/copied works? AI or not.

          Like selling it for money seems like a clear line crossed, and Etsy is the perfect gatekeeper here.

          • jsheard2 days ago
            Etsy stopped caring a while ago, it was supposed to be a marketplace specifically for selling handmade items but they allowed it to be overrun with mass produced tat dropshipped direct from the factory. Turning a blind eye to plagiarism with or without AI is just the next logical step from there.
          • a day ago
            undefined
          • girvo2 days ago
            > Why does Etsy allowed clearly reproduced/copied works

            They don't, in that they'll ban you for it once you're big enough

        • protocolture2 days ago
          Style isnt protected?
        • tomrod2 days ago
          Impossible to put the toothpaste back in the tube.
          • forgetfreeman2 days ago
            A motivated legislature with skilled enforcement personnel could get the toothpaste back in the tube in short order provided they displayed an anomalous insensitivity to making the money sad.
            • botanical762 days ago
              When has something like this ever happened? It feels like legislature exists to make money happy.
              • Terr_2 days ago
                One big example involves making a lot of US plantation-money sad and scared, so much that it started a civil-war.

                Granted, that was more the exception than the rule...

              • maeil2 days ago
                > When has something like this ever happened?

                Anything that used to be freely available but no longer is. Once upon a time Laudanum (tincture of opium) used to be the OTC painkiller of choice. In slightly more recent times, there's asbestos. In certain locales, gambling. There's countries that have reigned in lootboxes.

                > It feels like legislature exists to make money happy.

                Come on now, it doesn't just "feel" that way, you know for a fact that is indeed the purpose of the modern US legislature.

                • tomrod2 days ago
                  A single consumable is easy to ban. Computer binaries widely distributed isn't.
                  • maeila day ago
                    Doesn't seem too different from the lootboxes example.
        • llm_trw2 days ago
          Should any artist be able to tell another artist: hey don't copy my work when you're learning, I don't want competition?

          It seems like they are deeply upset someone has figured out a way for a machine to do what artists have been doing since time immemorial.

          • aithrowawaycomm2 days ago
            There are two major differences between art generators and human artists:

            1) human artists are legal persons and capable of being held liable in civil court for copyright infringement; having a machine with no legal standing do the copyright infringement should be forbidden because it is difficult to detect, impossible to avoid, and a legal nightmare to unravel.

            2) human artists are capable of understanding what flowers, Jesus on the cross, waterfalls, etc actually are, whereas DALL-E is much dumber than a lizard and not capable of understanding these things, so using the verb "learning" to describe both is extremely misleading. DALL-E is a statistical process which is barely more sophisticated than linear regression compared to a human brain. It is plain wrong to say stuff like this:

            > It seems like they are deeply upset someone has figured out a way for a machine to do what artists have been doing since time immemorial.

            when nobody has even come close to figuring that out! If DALL-E worked like a human artist it would know what a bicycle is: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr... But it doesn't. It is a plagiarism machine that knows how to match "bicycle" with millions images having a "bicycle" tag, and uses statistics to smooth things together.

          • JambalayaJimbo2 days ago
            Llms are not humans and shouldn’t be anthropomorphized as a strategy to get around copyright infringement.
            • protocolture2 days ago
              Absolutely correct, LLMs are code. If that code ingests data without precisely replicating it, that's fair use and the end of the discussion.
            • Terr_2 days ago
              And if they do get anthropomorphized... then the people in charge of that company need to be charged with the heinous crime of enslaving children.
            • econ18 hours ago
              I've been uhh.. suffering(?) a different perspective. A company may hire a human to talk for it or its owner may talk for it.. the server rack and the software are things it can own. I don't think others are nesasarly any longer talking for it.

              It is an absurd leap we've made but companies are also legal persons.

              The companies are still of human design full of human behaviour and human characteristics while the LLMs actively try to imitate humans.

              The dictionary saying: anthropomorphized: attribute human characteristics or behaviour to (a god, animal, or object).

              If it passes the Turing test surely anthropomorphizing is fair game?

              (I have no stake in this)

            • llm_trw2 days ago
              That's an opinion that will be tested in a court soon enough.
          • zdragnar2 days ago
            My mom said the teachers in her painting classes would have the students recreate works and were very clear on which artists had given permission for those derivative works to be sold. Others they could only admire at home.

            The problem is not "when learning", the problem is "when distributing". Courts will determine whether or not disseminating or giving access to a model trained on protected works counts as distributing protected derivative works or not.

            • Terr_2 days ago
              > The problem is not "when learning", the problem is "when distributing".

              Technically making a copy to bring home for your own use is also problematic, just much less likely to get you into trouble. (Still a step removed from learning the skills and technique of making a copy, however.)

          • mitthrowaway22 days ago
            This analogy seems to be made every time this comes up on HN, but I don't think it really holds water. First of all, when a human artist learns from another, it's inherently a level playing field for competition; the junior and senior are both human, neither are going to be 1,000,000x more productive as the other. So the senior artist really doesn't have that much to worry about. And the senior artist recognizes that they themselves were once a junior and had to learn from their seniors, so it's a debt paid forward which results in more art. And becoming a master artist or even an imitator takes dedication and hard work, even with lots of artwork to learn from, so that keeps competition to a certain level.

            When it takes decades to develop an art style that a machine can copy in days, and then churn out derivative variations in seconds, it's no longer a level playing field. The machine can dramatically under-cut the artist who developed their style, much more than a copycat human artist could. This does become not just a threat to the livelihoods of artists, but also a disincentive to the development of new art styles.

            In this case, patent law may be an apt comparison for the world we're entering. Patent law was developed with the idea in mind that it is a problem if a human competitor could simply take an invention, learn how it works, and then mass produce copies of it. There are several reasons for this, including creating an incentive for technology development, and also expediently transitioning IP to the public domain. But patents were added to the legal system basically because otherwise an inventor would not be on a level playing field with the competition, because it takes so many more resources to develop a new invention than to produce clones.

            Existing IP law was built in a world where it was believed that machines were inherently incapable of learning and mass-producing new artistic works using styles learned from artists. It was not necessary to protect artists from junior artists learning how to work in their style, as long as it wasn't a forgery. But in a world of machine learning, perhaps we will decide it's reasonable to protect artists from machine copycats, just like we decided it was reasonable protect technology inventors from human copycats.

            The patent system is not the right implementation; it's expensive to file a patent, and you need skilled lawyers to determine novelty, infringement, and so on. But for art and machine learning, it might be much simpler: a mandatory compensation for artists' work used as training data. Something like this is sometimes used in the music industry to determine royalties for radio broadcasting, or to account for copies spread by file sharing.

            • richardw2 days ago
              Surely all this applies to code and the written word in general?

              People allowed (and encouraged) read access to websites so Google would index and link. Now Google et al summarise and even generate. All of that is built on our collective output. Surely everyone deserves a cut? The free sharing licenses that were added to repos didn’t account for LLM’s, so we should revisit it so all creators get their dues, not just those who traditionally got paid.

              • mitthrowaway22 days ago
                (Yes, for what it's worth I agree with this!)
          • 2 days ago
            undefined
          • 2 days ago
            undefined
          • ncallaway2 days ago
            When a human does it, it's fine.

            When OpenAI's servers do it its copyright infringement.

            We don't apply copyrights to human brains, but we do apply copyright to computer memory.

            • 65102 days ago
              not sure if to pull out my racism card or to help burn the printing press.
          • protocolture2 days ago
            I dont know why you are getting downvotes, you are absolutely correct.

            A training method for some authors who want to adopt an older artists voice is to literally rewrite their novels. Word for word. They will go through an entire authors catalogue and reproduce them, so that they can learn to mimic them when creating something new.

            You go ahead and automate the process, and suddenly the world is ending.

            Ditto all other kinds of art. Heck I knew of 3 living artists doing this to each other in real time.

            • llm_trw2 days ago
              HN is not only filled with people who never learned how to code, it's filled with people who never learned how to write either.

              Hunter Thompson literally sat down and typed out every word of Hemingway's novels so he could figure out what good writing feels like.

              Why is he allowed to do it in private, but an LLM isn't?

              • benwad2 days ago
                Because Hunter S Thompson is a talented writer, and an LLM is a statistical method for generating text that looks like other text. An LLM isn't going to ingest all of Hemingway and write Hell's Angels.
                • llm_trw2 days ago
                  >You must be this talented for fair use to apply to you.
      • simonw2 days ago
        Has synthetic data become a big part of image/video models?

        I understand why it's useful and popular for training LLMs, but I didn't think it was applicable to generative image/video work.

        • llm_trw2 days ago
          I haven't had the chance to train diffusion models but for detection models synthetic data is absolutely how you get state of the art performance now. You just need a relatively tiny extremely high quality dataset to bootstrap from.
        • 2 days ago
          undefined
      • toddmorey2 days ago
        For clarity, I do agree that synthetic data is huge for training AI to do certain tasks or skills. But I don’t think creative work generation is powered by synthetic data and may not be for a quite while.
      • numpad02 days ago
        Isn't that just weird cope? I mean, why not just LLM automate UE if that's the goal & how isn't that itself going to get torpedoed by Epic?
  • oraphalous2 days ago
    I don't even understand why it's everyone elses problem to opt-out.

    Eventually - for how many of these AI companies would a person have to track down their opt-out processes just to protect their work from AI? That's crazy.

    OpenAI should be contacting every single one and asking for permission - like everyone has to in order to use a person's work. How they are getting away with this is beyond me.

    • munchler2 days ago
      Copyright doesn't prevent anyone from "using" a person's work. You can use copyrighted material all day long without a license or penalty. In particular, anyone is allowed to learn from copyrighted material by reading, hearing, or seeing it.

      Copyright is intended to prevent everyone from copying a person's work. That's a very different thing.

      • soared2 days ago
        There is an argument to be made that ChatGPT mildly rewording/misquoting info directly from my blog is copying.
        • Aeolun2 days ago
          And it is. And you can sue them for that. What you can’t do is get upset they (or their AI) read it.
        • munchler2 days ago
          Sure, but that's a different claim and a different argument.
        • 2 days ago
          undefined
        • amiantos2 days ago
          I think to make that argument you would need evidence that someone prompted ChatGPT to reword/misquote info directly from your blog, at which point the argument would be that that person is rewording/misquoting info directly from your blog, not ChatGPT.
          • Terr_2 days ago
            I don't think so: The user is merely making a request for copyrighted material, which is not itself infringing, even if their request was extremely specific and their intent was obvious.

            OpenAI would be the company actually committing the infringement and providing the copy in order to satisfy the request.

            If the law suddenly worked the other way around, companies would no longer be able to prosecute people for hosting pirated content online, because the responsibility would lie with the users choosing to initiate the download.

        • jillesvangurp2 days ago
          That would fall under fair use.

          Legally, you'd struggle to prove any form of infringement happened. Making a copy is fine. Distributing copies is what infringes. You'd need to prove that is happening.

          That's why there aren't a lot of court cases from pissed off copyright holders with deep pockets demanding compensation.

      • 23B12 days ago
        > Copyright doesn't prevent anyone from "using" a person's work.

        It should. The 'free and open internet' is finished because nobody is going to want to subject their IP to rampant laundering that makes someone else rich.

        Tragedy of the commons.

        • munchler2 days ago
          I can see this both ways. For the sake of argument, please explain why using IP to train an AI is evil, but using the same IP to train a human is good.

          Note that humans use someone else's IP to get rich all the time. E.g. Doctors reading medical textbooks.

          • Larrikin2 days ago
            Is the AI allowed to decide unprompted how to spend the money? Can it decide that it doesn't like the people who made it and donate it to charity. Can the AI start it's own company and not hire anyone that made it? Can the AI decide that it prefers the open Internet and will answer all questions for free?

            The sake of argument is a cowards way of expressing an unpopular opinion in public. Join a debate club if you're actually being genuine.

          • 23B1a day ago
            I never used the word evil.

            That said, machines don't have natural rights, and you don't get to use them to violate mine.

          • simion3142 days ago
            >Note that humans use someone else's IP to get rich all the time. E.g. Doctors reading medical textbooks.

            You need a better example, a textbook was created with this exact purpose of sharing knowledge with the reader.

            My second point, if you write a poem and I read it and memorize it, then publish it as my own with some slight changes you would be upset?

            If I get your painting, then use a script to apply a small filter to it then sell it as my own, is this legal? is my script "creative"?

            This AIs are not really creative, they just mix inputs and then interpolate an answer , is some cases you can't guess what input image/text was used but in other cases it was shown ezactly the source that was used and just copy pasted in the answer.

            • Ukv2 days ago
              > My second point, if you write a poem and I read it and memorize it, then publish it as my own with some slight changes you would be upset?

              I feel the problem with analogizing to humans while trying to make a point against unlicensed machine learning is that applying the same moral/legal rules as we do to humans to generative models (learning is not infringement, output is only infringement if it's a substantially similar copy of a protected work, and infringement may still be covered by fair use) would be a very favorable outcome for machine learning.

              > they just mix inputs and then interpolate an answer , is some cases you can't guess what input image/text was used

              Even if you actually interpolated some set of inputs (which is not how diffusion models or transformers work), without substantial similarity to a protected work you're in the clear.

              > is my script "creative"? [...] This AIs are not really creative [...]

              There's no requirement for creativity - even traditional algorithms can make modifications such that the result lacks substantial similarity and thus is not copyright infringement, or is covered by fair use due to being transformative.

          • 23B12 days ago
            scale
        • amiantos2 days ago
          Under this mentality, every search engine index would be shut down.
    • griomnib2 days ago
      Napster had a moment too, but then they got steamrolled in court.

      Courts are slow, so it seems like nothing is happening, but there’s tons of cases in the pipeline.

      The media industry has forced many tech firms to bend the knee, OpenAI will follow suit. Nobody rips off Disney IP and lives to tell the tale.

      • tiahura2 days ago
        If your business model depends on the Roberts' court kneecapping AI, pivot. Training does not constitute "copying" under copyright law because it involves the creation of intermediate, non-expressive data abstractions that do not reproduce or communicate the copyrighted work's original expression. This process aligns with fair use principles, as it is transformative, serves a distinct purpose (machine learning innovation), and does not usurp the market for the original work.
        • paranoidrobot2 days ago
          I believe there are some other issues other than just "is it transformative".

          I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".

          Similarly, I can't take a sample of a Taylor Swift song and use it myself in my own music - I have to give Taylor credit, and probably some portion of the revenue too.

          There's also still the issue that some LLMs and (I believe) image generation AI models have regurgitated works from their training models - in whole or part.

          • protocolture2 days ago
            >I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".

            If you dont replicate Warhols painting entirely, then you are fine. Its original work.

            The number of Scifi novels I read that are just an older concept reimagined with more modern characters is huge.

            >I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work. I have some obligation to say "Yeah, I used a Warhol painting as the basis for it".

            In most sane jurisdictions you can sample other work. Consider collage. It is usually a fair use exemption outside of the USA. If LLMs cause keyboard warriors to develop some seppocentric mindvirus leading to the destruction of collage I will be pissed.

            >There's also still the issue that some LLMs and (I believe) image generation AI models have regurgitated works from their training models - in whole or part.

            Considered a high priority bug and stamped out. Usually its in part because a feature is common to all of an artists work, like their signature.

          • dwallin2 days ago
            > I can't take an Andy Warhol painting, modify it in some way and then claim it's my own original work.

            This is a hilarious choice of artist given that Warhol is FAMOUS for appropriating work of others without payment, modifying it in some way, and then turning around and selling it for tons of money. That was the entire basis of a lot of his artistic practice. There was even a Supreme Court case about it.

        • mitthrowaway22 days ago
          There was a time when it did not usurp the market for the original work, but as the technology improves and becomes more accessible, that seems to be changing.
        • toddmorey2 days ago
          In my experience when existing laws allow an outcome that causes enough significant harm to groups with influence, the laws gets changed.
        • 23B12 days ago
          > Training does not constitute "copying" under copyright law

          It should.

          • caeril2 days ago
            Absolute F-tier reasoning, here. Your own biological neural net is trained on a huge corpus of copyrighted works.

            How would you like to be sued by your favorite author because you wrote some fan fiction in a similar style?

            No, training a model should be no more violating copyright than training your own brain.

            • 23B1a day ago
              Absolute F-tier reasoning, here. LLMs (and the companies that build them) aren't people.
              • caerila day ago
                Ah, so you subscribe to the Magical Jesus Box hypothesis of mind?

                We are not magical Jesus boxes. We are evolved machines, that just happen to be based on a carbon substrate, not a silicon one.

                There is nothing special about us.

                • 23B1a day ago
                  Laws are made by people for people, not star trek nerd fantasies co-opted by techbros who buy gruebel-forseys
      • llm_trw2 days ago
        And yet Micky Mouse is in the public domain. Something those of us who remember the 90s thought would never happen.
        • timcobb2 days ago
          Just the oldest Mickey. They gave up on it because the cost/benefit wasn't deemed worth it anymore.
    • paulcole2 days ago
      > OpenAI should be contacting every single one and asking for permission - like everyone has to in order to use a person's work

      This is the problem of thinking that everyone “has” to do something.

      I assure you that I (and you) can use someone else’s work without asking for permission.

      Will there be consequences? Perhaps.

      Is the risk of the consequences enough to get me to ask for permission? Perhaps.

      Am I a nice enough guy to feel like I should do the right thing and ask for permission? Perhaps.

      Is everyone like me? No.

      > How they are getting away with this is beyond me.

      Is it really beyond you?

      I think it’s pretty clear.

      They’re powerful enough that the political will to hold them accountable is nonexistent.

    • CamperBob22 days ago
      I don't even understand why it's everyone elses problem to opt-out.

      Because the work being done, from the point of view of people who believe they are on the verge of creating AGI, is arguably more important than copyright.

      Less controversially: if the courts determine that training an ML model is not fair use, then anyone who respects copyright law will end up with an uncompetitive model. As will anyone operating in a country where the laws force them to do so. So don't expect the large players to walk away without putting up a massive fight.

      • SketchySeaBeast2 days ago
        Of note here is the reason it's "important" is it will make a shit-ton of money.
        • CamperBob22 days ago
          That, coupled with the obvious ideological motivations. Success could alter the course of human history, maybe even for the better.

          If you feel that what you're doing is that important, you're not going to let copyright law get in the way, and it would be silly to expect you to.

          • SketchySeaBeast2 days ago
            I can't say I believe that. If that were the case, they'd focus more on results and less on hyping up the next underwhelming generation.
            • CamperBob22 days ago
              For one thing, they are focused on money because they need lots of it to do what they're doing.

              For another, the o1-pro (and presumably o3) models are not "underwhelming" except to those who haven't tried them, or those who have an axe to grind. Serious progress is being made at an impressive pace... but again, it isn't coming for free.

          • 2muchcoffeeman2 days ago
            Oh please. OpenAI and I guess every other AI company are for-profit.

            The only change they are motivated by is their bank balances. If this were a less useful tool they’d still be motivated to ignore laws and exploit others.

            • CamperBob22 days ago
              Hard to say what motivates them, from the outside looking in. There have been signs of cultlike behavior before, such as the way the rank and file instantly lined up behind Altman when he was fired. You don't see that at Boeing or Microsoft.

              Obviously it's a highly-commercial endeavor, which is why they are trying so hard to back away from the whole non-profit concept. But that's largely orthogonal to the question of whether they feel they are doing things for the benefit of humanity that are profound enough to justify blowing off copyright law.

              Especially given that only HN'ers are 100% certain that training a model is infringement. In the real world, this is not a settled question. Why worry about obeying laws that don't even exist yet?

              • maeil2 days ago
                > Hard to say what motivates them, from the outside looking in.

                It isn't.

                > There have been signs of cultlike behavior before, such as the way the rank and file instantly lined up behind Altman when he was fired.

                This only reinforces that the real drive is money.

              • 2muchcoffeeman2 days ago
                >Especially given that only HN'ers are 100% certain that training a model is infringement. In the real world, this is not a settled question. Why worry about obeying laws that don't even exist yet?

                This is exactly why people are against it.

                Your argument is that there is no definitive law. Therefore the creators of the data you scrape to train, and their wishes are irrelevant.

                If the motivation was to help humanity, they’d think twice about stepping on the toes of the humanity they want to save and we’d hear more about nontrivial uses.

                • CamperBob22 days ago
                  Your argument is that there is no definitive law. Therefore the creators of the data you scrape to train, and their wishes are irrelevant.

                  Correct, that is the position of the law. Here in America, we don't take the position, held in many other countries, that everything not explicitly permitted is forbidden. This is a good thing.

                  If the motivation was to help humanity, they’d think twice about stepping on the toes of the humanity they want to save

                  Whether it is permissible to train models with copyrighted content is up to the courts and Congress, not us. Until then, no one's toes are being stepped on. Everybody whose work was used to train the models still holds the same rights to that work that they held before.

                  • >Until then, no one's toes are being stepped on. Everybody whose work was used to train the models still holds the same rights to that work that they held before.

                    And yet artists don’t feel like their work should be used for training.

                    I’m not sure how you can argue that the intentions are unknowable, when clearly you and the AI companies don’t care about the people whose work they have to use to train their models and these people’s wishes. Motivation is greed.

                    • CamperBob2a day ago
                      And yet artists don’t feel like their work should be used for training.

                      The law isn't really all that interested in how "artists feel." Neither am I, as you've surmised. The artists don't care how I feel, so it would be kind of weird for me to hold any other position.

                      In any case, copyright maximalism impoverishes us all.

  • griomnib2 days ago
    I think it’s safe to assume anything Sam A says is an outright lie by now.
    • maeil2 days ago
      It's depressing that this understanding hasn't been the status quo for years now. It's not like this is his first gig, it's been publicly verifiable what kind of person he is for ages, long before GPT became famous. You don't need to be part of some insider Silicon Valley cabal to find out.
      • hashxyz2 days ago
        Can you back that up with anything? I’ve gotten this as a vague sense, but it seems hard to find much actual background about how he manages to continuously fail upward.
    • 2 days ago
      undefined
  • dgfitz2 days ago
    Eventually the headline will be the first 2 words.

    The tech is neat, there is value in a sense, LLMs are a fun tech. They are not going to invent AGI with LLMs.

    • wilg2 days ago
      who cares if they do it with LLMs or not? how do you define agi?
      • mschuster912 days ago
        > how do you define agi?

        An AI that has enough sense of self-awareness to not hallucinate and to recognize the borders of its knowledge on its own. That is fundamentally impossible to do with LLMs because in the end, they are all next-token predictors while humans are capable of a much more complex model of storing and associating information and context, and most importantly, develop "mental models" from that information and context.

        And anyway, there are other tasks than text generation. Take autonomous driving for example - a driver of a car sees a person attempting to cross a street in front of them. A human can decide to slam the brake or the gas depending on the context - is the person crossing the car some old granny on a walker or a soccer player? Or a human sees a ball being kicked into the air on the sidewalk behind some cars, with no humans visible. The human can infer "whoops, there might be children playing here, better slow down and be prepared for a child to suddenly step out onto the street from between the cars", but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.

        • og_kalu2 days ago
          >Take autonomous driving for example - a driver of a car sees a person attempting to cross a street in front of them. A human can decide to slam the brake or the gas depending on the context - is the person crossing the car some old granny on a walker or a soccer player? Or a human sees a ball being kicked into the air on the sidewalk behind some cars, with no humans visible. The human can infer "whoops, there might be children playing here, better slow down and be prepared for a child to suddenly step out onto the street from between the cars"

          These are just post-hoc rationalizations. No-one making those split-second decisions under those circumstances has those chains-of-thoughts. The brain doesn't 'think' that fast.

          >but an object detection/classification lacks that ability to even recognize the ball as being a potentially relevant piece of context.

          We're talking about LLMs right ? They can make these sort of inferences.

          https://wayve.ai/thinking/lingo-2-driving-with-language/

          • onemoresoop2 days ago
            You’re one of those who think the human brain is just an LLM?

            It could be possible to use LLMs to build a rube goldberg type of brain or somethingt hat will mimic a human brain but it will have the same flaws LLMs have and will never reach parity with humans. I think AGI is possible but we’re too focused on LLMs to get there yet.

        • wilg2 days ago
          again i don't care whether its done with an LLM or not. there's no reason to think openai will only build LLMs. recognizing borders of its knowledge is a reasonable thing to include in an agi definition i suppose, but does not seem intractable.

          for the second one, ai drivers like tesla's current version is already skipping the object detection/classification and instead uses deep learning on the entire video frame and could absolutely use the ball or any other context to change behavior, even without the particular internal monologue describe there.

          • onemoresoop2 days ago
            I haven’t seen any new sparks of intelligence. But it remains to be seen what OpenAi does. So far I haven’t seen any paradigm shifts or indications they’re not just scaling up and making their training corpus more vast. I could be wrong but if they had something we’d know by know. Every chatgpt release has been hyped up but somewhat dissapointing to many. But what do I know ..
        • PittleyDunkin2 days ago
          > An AI that has enough sense of self-awareness to not hallucinate

          It's not entirely clear that this is meaningful. Humans engage in confabulation, too.

          • onemoresoop2 days ago
            Humans engage in confabulation but they’re mostly aware of it. In some mental disorders they may not be aware; though statistically that is not too significant and no, we normally don’t confabulate as much as the current crop of AI aka LLMs.

            As a tool LLMs are fantastic and am glad to look at them as solely as powerful tools. AGI is not here yet and maybe that’s a good thing. Who would want some kind of artificial intelligence that is capable of understanding us, that is capable of using psychological tricks on people, that could have different goals than us and so on.

            • wilg2 days ago
              https://www.medicalnewstoday.com/articles/confabulation#vs-l...

              > Confabulators are usually unaware they are providing false information. They often display genuine surprise or confusion when evidence of facts contradicts their statements.

              This is similar to LLMs actually. But it also seems like various "System 2" things like chain of thought could compensate for this issue in the LLM (and that possibly that is similar to how the brain works).

            • PittleyDunkin2 days ago
              > Humans engage in confabulation but they’re mostly aware of it.

              I'm not sure this is the case at all. Some awareness of this doesn't imply full awareness. In my experience, most people are unaware of how incoherent their worldviews are, so the distinction between normative and confabulatory behavior isn't clear.

      • portaouflop2 days ago
        We have this discussion every minute -.-
      • lm284692 days ago
        I care because it's brought to us by same same deranged brains who promised self driving cars "in two years" every year since 2012, and a fully autonomous mars city "by 2030".

        We're all wasting time and resources on what basically amounts to alchemy while we could tackle real problems.

        Tech solutionists keep making promises for the next 5-10-20 years and never deliver, AI, electric planes, clean fuel, autonomous cars, the meta verse, brain implents. You'd expect the internet would have made people smarter but we fall for the same snake oil as 100 years ago, en masse this time

        • wilga day ago
          i mean there’s progress on all those things and that’s good and there’s no downside really?
      • goatlover2 days ago
        Whatever makes Open AI enough money?
      • dgfitz2 days ago
        … very carefully?
  • hnburnsy2 days ago
    Maybe the task to implement it was scheduled by ChatGPT...

    https://news.ycombinator.com/item?id=42716744

    • Bilal_io2 days ago
      Sorry the task failed for unknown reasons.
  • DidYaWipe2 days ago
    Shocking news about the company that fraudulently left "open" in its name after ripping off donors.

    I think the headline is too generous here. More accurate would be "OpenAI neglects to deliver opt-out system..."

    • HeatrayEnjoyer2 days ago
      Sorry, who did they rip off?

      All their investors stand to profit handsomely (if they live).

      • hansvm2 days ago
        They ripped off everyone they lied to. The took money under the premise that they'd put humanity first as this AI transition happened (both in safety and in knowledge sharing), and they instead used that money to build a moat that'll make it harder for anyone else to accomplish those same goals. Investors in the original vision would have been better off had they not contributed any funds, and the monetary profit they're receiving in exchange won't be enough to offset those damages (in the sense that it's not enough to fund somebody attempting to execute the same mission now that OpenAI exists -- at least not with the same chance of success they anticipated when OpenAI was younger).
      • DidYaWipea day ago
        Are you saying that donors to their "non-profit" received shares in the now-for-profit enterprise?

        And if so, do you have a citation for that?

  • thrance2 days ago
    Another one of these daily reminders that we live in a two-tiered justice system, everything you ever created is fair game to them, but don't you dare use a leak of their weights lest you want to be thrown in jail.
    • jsheard2 days ago
      According to OpenAI you're not even allowed to use GPT output to train a competing model, so they believe that AI models are the only thing worthy of protection from being trained on. Llama used to have a similar clause, which was partially walked back to "you must credit us if you train on Llama output" in later versions, but that's still a double standard since they don't credit anything that Llama was trained on. For obvious reasons now we know that Zuck personally greenlit feeding it pirated books.
      • umeshunni2 days ago
        Well that hasn't stopped Deepseek.
        • pton_xd2 days ago
          Honestly, good for them. This whole "we can use your output for our input, but don't even think about doing the same" is just absurd.
  • devit2 days ago
    Aren't lawsuits the proper way to address this?

    Seems like there's an argument that model weights are a derivative work of the training data, at least if the model is capable of producing output that would be ruled to be such a derivative work given minimal prompting.

    Although it may not work with photography since the model might just almost exclusively learn how the object of the photo looks in general and how photos work in general, rather than memorizing anything about specific photos.

    • fenomas2 days ago
      I think that argument falls down though, because a derivative work is an expressive work in its own right, and model weights aren't.

      It would seem more coherent to argue that a model output could be a derivative work, though it would need to include a significant portion of some given source. But even then, since the copyright office's position is that they're not copyrightable, I'm not sure they could qualify.

      • devit2 days ago
        Model weights, if they can reproduce something like the original, are just a form of lossy compression (or even lossless for text), where the LLM answering the prompt is a more powerful version of asking software to retrieve a specific file from a Zip archive (or a webserver answering an HTTP query) of such lossy compressed data.

        So if model weights don't infringe, that would also imply that saving an image as a JPG or a video using AV-1 doesn't infringe, which would obviously effectively implies that copyright doesn't apply to images or videos on the web, which is not current law/policy, so I think that reasoning cannot possibly work.

        • fenomas2 days ago
          That comparison would only make sense if compressed images were considered derivative works. They're not - copyright doesn't protect bytes on a disk, it protects creative expressions. Lossy compression doesn't affect the creative expression, so in copyright terms a compressed JPG is just a copy, and is covered exactly like the original image.

          In contrast a derivative work is one creative expression that contains elements of another - like when you take an image and add commentary, or draw your own addition onto it, etc. And I'm pointing out that a trained model is not that - it's not itself a copyrightable expressive work. (We could think of it as a kind of algorithm for generating works, but algorithms aren't copyrightable.)

          • devit16 hours ago
            Well then the model weights would be a compilation of copies of the original works, which has the same effect as it being a derivative work unless the copyright holder chose to allow copies but not derivative works.
  • econ2 days ago
    In my mental imagery this is a situation that any advancing civilization in the universe should eventually run into. There will be all kinds of materials from the laborious and expensive to the effortless and "I was the first" or some other entitlement. It all boils down to having or not having such automatons. I'm sure there have been plenty who, like us with our books, have successfully denied progress. I'm also sure there have been plenty where it was completely obvious to upload the entire database of ET knowledge.

    It is equally obvious what the later gained and the former lost in the process.

    We, with our books, have successfully prevented people from educating themselves with amazing implications. Now the challenge is to create equally impotent machines!

    You have no further questions:)

    • econ2 days ago
      The brain chip is just one more interface. I feel the need to remind the younglings that in the time before the internet we talked a lot and talked about whatever we wanted to. Moderation was done by the speaker himself by knowing people. Imagine that! It sometimes got emotional or violent but that is an important part of communication. Looking at your watch was somewhat of an insult as if the other failed to be interesting enough. Today no one is interesting enough to talk in long form, few remember how

      Now imagine direct thought moderation. After all, thoughts belong to people? I thought it first? You can't just... It is clear we should control your thoughts. We can't just have you think random things. It would be like like TikTok! Or like reading books!Terrifying!

      We are quite used to the man behind the curtain deciding everything for us. At what point would the deal get to absurd I wonder? Would 1984 eventually become a really boring book? Would it exist at all? Would people save up social credits to read it?

      Other civilizations must have tried all possible variations with rather predictable results. To a free mind I mean.

      Or are we already puppets on a string? How much am I boring you with this? Should I be allowed?

    • DaiPlusPlus2 days ago
      > We, with our books, have successfully prevented people from educating themselves with amazing implications

      Que?

      No, really... what?

      • econa day ago
        I mean how we, in stead of setting the books free, keep them in cages and sell tickets.

        It seems to me any civilization in the history of the cosmos will inevitably reach a stage where they have choose to make knowledge available in order to solve problems.

        One should only have to type the title of a book then get to browse around for a bit. Send a link to someone etc

        Anything else is suicidal nonsense.

        Tax hard working people to pay to defend dead peoples pixels from copying?

        No one knows who or what an author is if there even is one. If I generate or write by hand all word combinations I don't get to own them.

        Enforcement is much to expensive for normal people if one even notices the copying. They just get to pay for it.

        An elaborate scheme in order to not solve problems, not innovate and not progress.

  • Terr_2 days ago
    "By continuing, you agree that using any content from this site in training Generative AI grants the site-owner a perpetual, irrevocable, and royalty-free license to use and re-license any and all output created by that Generative AI system, including but not limited to derivative works based on that output."

    Just just a GPL-esque idea I've been musing lately [0], I'd appreciate any feedback from actual IP lawyers. The idea is to add a poison-pill, and if a company "steals" your content for their own profit, you can strike back by making it very hard for them to actually monetize the results. Since it's a kind of contract, it doesn't rely on how much work seems to be surfacing in a particular output.

    So supposing ArtTheft Inc. snarfs up Jane Doe's paintings from her blog, she--or any similar victim--can declare that they grant the world an almost-public license to anything ArtTheft Inc. has made based on that model. If this happens ArtTheft Inc. could still make some money selling physical prints, but anyone else could undercut them with free or cheaper copies.

    [0] https://news.ycombinator.com/item?id=42582615

    • amiantos2 days ago
      That's cute, I'm going to put that at the start of any creative work I make so that anyone who sees it owes me a license to everything they ever made afterward because a nugget of their life experience legally belongs to me now and all their creative works are now tainted by that.
      • Terr_2 days ago
        I didn't expect you to admit you were a computer program so readily. :p

        Do you have any more-substantive critique? It sounds like you're trying to argue that the terms will be found to be unconscionable. However it's not asking for any payment, or even any effort-taking action: it's just saying that the site-owner provides content on the condition that if you incorporate that content into a generative product, the site-owner gets to use the results too. Clearly the people hoovering up training-data believe my work has some economic value to themselves, or they wouldn't be running a giant web crawler hitting every page of the blog--it's not as if they're arriving out of boredom, or because they followed some opaque hyperlink in curiosity.

    • protocolture2 days ago
      I dont think you can exclaim away fair use protections. Otherwise everyone already would.
      • Terr_a day ago
        > I dont think you can exclaim away fair use protections.

        "Copyright doesn't stop me from X" is different from "copyright lets me do X even though I agreed to a contract saying I wouldn't." (I have many problems with modern click/shrink-wrap, but that's a whole 'nother can of worms and I'm just trying to "fight fire with fire" here.)

        If the average ToS has no force, then HN is currently infringing on my copyright by showing this post to you.

        • protocolturea day ago
          Theres no valid TOS provided to a human to read when I am scraping the entire internet. I dont think wget can sign a contract?

          Has anyone managed to hit Google or Yahoo with a TOS violation?

          • Terr_20 hours ago
            Unclear.

            There was hiQ Labs v. LinkedIn but that focused on whether it was unauthorized access under the CFAA.

            In X Corp. v. Bright Data Ltd., a quick skim suggests ExTwitter's ToS lost because (A) it wasn't really the owner of the content and (B) they couldn't easily show harm.

            IANAL again, but for personal blog, (A) is unlikely to apply, and (B) could be shown if ArtTheft Inc. starts causing legal fees by threatening the blogger for exercising the re-licensing in the ToS.

  • Der_Einzige2 days ago
    Good.

    Everyone gets big mad when someone with money acts like Aaron Swartz did. The only bad thing about OpenAI is that they're not actually open sourcing or open accessing their stuff. Mistral or Llama "training on pirated material" is literally a feature, not a bug and the tears from all the artists and others who get mad are delicious. These same artists would profess literal radical marxism but become capitalist luddite copyright trolls the moment that the means of intellectual production became democratized against their will.

    If you posted something on the internet, I can and will put it into ipadapter and take your style and use it for my own interests. You cannot stop me except by not posting it where I can access it. That is the burden of posting anything on the public internet. You opt out by not doing it.

    • tehjoker2 days ago
      Such a weird argument. A company doing the same thing as Aaron Swartz is doing it for personal gain, not for our collective benefit.
    • amiantos2 days ago
      It is comical to me how fast the anti-RIAA internet turned into a bunch of copyright maximalists who expect organizations like the RIAA to protect them in some way. In actuality, if someone manages to weaponize copyright against AI, it will only successfully be used by massive rights holders to extract payouts from AI companies and none of the money will be given to any of the creatives, and creatives will naturally still not be very happy about it. Spotify 1.0 is right holders streaming your content and paying you fractions of pennies for it, "Spotify 2.0" will be them licensing your content to AI companies and paying you a fraction of a fraction of a penny just once.
  • dadbod2 days ago
    The tool was called "Media Manager" LMFAO. A name so uninspired it perfectly reflected how little they cared.
    • grajaganDev2 days ago
      LOL - it sounds like something from 90's era Microsoft.
  • monomyth2 days ago
    this is as retarded as asking someone to forget a picture they have seen
  • 92834092322 days ago
    People need to understand these companies are not good actors and will not let you opt out unless forced. I have a 20 dollar bet with a friend that Trump's admin will get training data classified as fair use and the whole issue will be done away with anyway
    • protocolture2 days ago
      Its clearly fair use regardless of what trump does.
      • lm284692 days ago
        I don't think anything is clear when it comes to laws... Even 10 words sentences from the constitution have been debated since they exist.
      • 92834092322 days ago
        It isn't clearly anything since there are dozens of lawsuits about it going on right now.
        • protocolturea day ago
          Other than the bloke who tried to make his LLM the author, I havent seen any lawsuits uphold your point of view.

          There was a great article posted here last year rounding up all the various courts who upheld ownership for prompters of LLM output, including China (Possibly twice)

          If recombining data from images in a way that theres not a single trace of any original violates fair use, then fair use ceases to exist. There is hardly a fairer use. Any existing fair use outcome uses actual recognizable elements of the original work. There isn't really 2 directions on this. The damage that success for the anti ai folk would have against IP law is tremendous.

    • dylan6042 days ago
      Apparently, Trump has a lot of training data stored in a bathroom, so there's that
  • testfrequency2 days ago
    Probably means nothing, but all the people I know who went to OpenAI and are still there are all the people who made very poor business decisions and were hated at multiple companies I worked for.

    High doubt any of them will be good stewards of anything but selfishness.

    As for the others, they were all smart, passionate, dedicated folks who knew Sam was a complete narcissist and left to start their own AI startups.

    (sorry mods, I’m upset and I’m annoyed OpenAI is getting away with murder of society in plain view)

  • passwordoops2 days ago
    Give it a month and they should have no problem deploying their inevitable AGI to deliver the opt-out system, right? /S
    • 2 days ago
      undefined
  • allsummer2 days ago
    One should be able to opt-out for training AI, but then testing AI should also become impossible. Else you are freeloading just as much as you accuse the AI companies of.