319 pointsby ta9886 days ago21 comments
  • rglover6 days ago
    Aaron was the OG. If you've never dug through his blog, do yourself a favor [1]. Also make some time to watch The Internet's Own Boy doc about him [2] and look up some of his talks during the SOPA shenanigans. RIP.

    [1] http://www.aaronsw.com/weblog/

    [2] https://www.youtube.com/watch?v=9vz06QO3UkQ&rco=1

  • benatkin6 days ago
    I know I support what aaronsw did and I don’t think he shouldn’t have gotten in any trouble for it, let alone to the tragic level it went to. As for sama, I’m not sure, on one hand I like the innovation and on the other hand it’s very worrying for humanity. I appreciate the post and the fond memories of Aaron but I’m not in complete agreement with the author about sama.
  • idlewords5 days ago
    The right person to contrast with Aaron Swartz is Alexandra Elbakyan. She got it done without any of the drama, fuss, or prominent mentors.
    • divbzero5 days ago
      She absolutely got it done, but not surprisingly there was still significant legal backlash she had to deal with.

      https://www.nature.com/articles/nature.2017.22196

      • idlewords5 days ago
        Yeah, I'm not saying it was easy for her. Quite the opposite!
    • csomar5 days ago
      I guess it helped she lived in Russia? where she could throw the US governments laws into the garbage bin?
  • mastazi6 days ago
    In the photo there are some other faces that I think I might recognise, but I'm not 100% sure. Is there a list of everyone in the picture somewhere on the internet?

    Edit I think the lady on the left is Jessica Livingston and a younger PG on the right

    • aimazon6 days ago
      https://i.imgur.com/e0GPhSE.jpeg

      1. zak stone, memamp

      2. steve huffman, reddit

      3. alexis ohanian, reddit

      4. emmet shear, twitch

      5. ?

      6. ?

      7. ?

      8. jesse tov, https://www.ycombinator.com/companies/simmery

      9. pg

      10. jessica

      11. KeyserSosa, initially memamp but joined reddit not long after (I forget his real name)

      12. phillip yuen, textpayme

      13. ?

      14. aaron swartz, infogami at the time

      15. ?

      16. sam altman, loopt at the time

      17. justin kan, twitch

      • zurvanist5 days ago
        > 11. KeyserSosa, initially memamp but joined reddit not long after (I forget his real name)

        Chris Slowe

        Clickfacts had 3 founders so probably that's 3 of your ?s.

        The photo has no 13 btw.

      • mastazi6 days ago
        Amazing, thank you!
    • tptacek6 days ago
      No, that's Jessica Livingston, the cofounder of YC.
      • mastazi6 days ago
        yes, you are right, I edited my comment.
    • idlewords5 days ago
      It would be pretty terrifying if it was an older PG.
  • ilrwbwrkhv6 days ago
    Oh man. Heavy stuff. Our industry will be looked at as good or bad? I hope we end up doing good for the world.
    • memonkey6 days ago
      Hard to say when there is a profit motive for all industries. Seems like every industry at the moment is not really looking for human advancement, or maybe it is looking at advancing but only if the results are expensive for end users and efficient/proprietary for the company.
      • ilrwbwrkhv6 days ago
        Yes but the thing is our industry has almost unparalleled leverage and marginal utility cost is zero.
    • zx10rse6 days ago
      Open source the models it is the only right decision.
      • remram6 days ago
        I don't understand. If something hurts your civilization but it was free, does that make it better? Like if everyone was able to build a nuclear bomb, would that make the ensuing nuclear winter more moral?
        • zx10rse5 days ago
          You contradict your own analogy. Do you live in nuclear winter? Because there are plenty of countries with nuclear weapon capabilities half of them are considered enemies. Why are they not bombing each other? Why do you even consider nuclear winter as a moral thing?
  • khazhoux6 days ago
    Thank you, Sam Altman and everyone at OpenAI, for creating ChatGPT and unleashing the modern era of generative AI, which I use every day to speed up my job and coding at home.

    Signed,

    Someone who doesn't care that you're making $$$$ from it

    • edgineer6 days ago
      The point is that regardless if you're negative, neutral, or positive of others using data for profit, you would hold those who use it altruistically higher.
      • xtracto6 days ago
        I hold both of them high enough. As Aaron, I did my good share of book/articles piracy, even before it was online (here in Mexico it was veery common to Xerox and share whole books with students in the 80s and 90s).

        I understand, Aaron became a martyr; even though he died due to depression and not for "a cause". I applaud what he achieved as a person.

    • amelius6 days ago
      The usual caveat applies. I'm okay they make money from it until they start using that money against the rest of us.
    • angoragoats6 days ago
      If I’m an author and I don’t want my work included in the corpus of text used for training ChatGPT, should I have that right?

      What about if I’m an artist and I don’t want my work included in the training data for an image generation model?

      • 6274676 days ago
        You may have that right as long as you agree that others have the right to not care about your right when deciding to use "your" stuff however they want.
        • kevingadd6 days ago
          The two halves of your statement contradict each other. What are you trying to say?
        • angoragoats6 days ago
          Congratulations on not answering the question I asked, and at the same time saying something that makes no logical sense.
      • worik6 days ago
        > I don’t want my work included in the corpus of text used for training ChatGPT, should I have that right?

        No

        You could choose not to publish, and be read

        If you are read you can be used to learn from

        • angoragoats6 days ago
          Generative AI is not learning.

          Copyrights don’t depend on whether I choose to publish a particular work or not.

          Zuckerberg personally approved using pirated content to train a model. Is that OK too?

          • Kiro6 days ago
            Absolutely. Abolish all copyright. I can't believe hackers have lost their pirating roots and have become copyright zealots instead.
            • angoragoats5 days ago
              Just wanted to respond to this point as well, to say that while this is a logically consistent stance, painting all hackers as pirates is unfair.

              Even Aaron Swartz, the subject of this post, made a distinction (both in his writing/talks and his actions) between academic, scientific, and judicial articles, which he believed should be free for everyone, and other copyrighted content such as books and movies. He seemed to take a softer stance on the latter.

              I am not a copyright zealot by any means, but I’m trying to keep an open mind regarding the seemingly knee-jerk “abolish all copyright” takes from certain people in the AI sphere.

            • xtracto6 days ago
              This!. Thing is you are preaching to the wrong choir,

              Hackernews roots were not about the "original" hacking, but more about Silicon Valley for profit entrepreneurs creating stuff for profit to heavy quotes "make the world a better place" and "democratization X".

              The real hackers were in usenet, /. and similar forums. Now, I'm too old and haven't found a good place.

              Information wants to be free, copyright is a last century millennium construct to earn money from invention. (Before copyright, people got paid by patrons to invent their stuff).

              Copyright should be left at the XX century door, and whoever wants to "monetize" their intellect should use methods for the XXI century.

              • angoragoats5 days ago
                At least in the United States, copyright has existed since the founding of the country, and the Constitution explicitly grants Congress the authority to regulate it. It’s not as recent as you make it out to be, and it’s not going away anytime soon. To the extent that there is a problem to be solved here, it must be solved within the context of copyright law.
          • starfezzy6 days ago
            IP is fake.

            What I mean is, all laws are fake, but while we have to follow some collective hallucination about magic words giving people authority—to keep society together—the specific delusion that we should enforce artificial scarcity by extending monopolies to patterns of information is uniquely silly.

          • worik5 days ago
            > Generative AI is not learning.

            Walks like a duck and quacks like a duck; I think it's a duck

            • angoragoats5 days ago
              No, it neither walks nor quacks like a duck.

              Once a model is trained, the weights are static unless you train it again. This is a huge practical difference between LLMs and humans. I need to give the LLM the context I want it to consider every single time I use it, because it doesn’t learn from past use at all, it just produces a plausible string of words based on its constant, unchanging set of weights.

          • wiseowise6 days ago
            Yes.
      • CaptainFever6 days ago
        No, you should not have that right. Copyright allows you to sell artificial scarcity. AI does not replicate your work directly. So you can still sell your artificial scarcity even if it is trained on.

        At least you're acknowledging that training rights are a proposed expansion of current IP laws!

        • angoragoats6 days ago
          > Copyright allows you to sell artificial scarcity.

          Not always. That’s more the domain of patents, honestly.

          > AI does not replicate your work directly.

          This is false, at least in certain cases. And even if it were always true, it doesn’t mean it doesn’t infringe copyright.

          > At least you're acknowledging that training rights are a proposed expansion of current IP laws!

          Yes, they are, emphasis on the “proposed,” meaning that I believe that training AI on copyrighted material without permission can be a violation of current copyright law. I don’t actually know if that’s how it should be, honestly. But as of right now I think entities like the New York Times have a good legal case against OpenAI.

          • CaptainFever6 days ago
            You are probably correct, legally. But given that we're talking about Aaron Swartz, who was legally in the wrong but morally (in the classic hacker's sense) in the right, I meant "copyright allows you to sell artificial scarcity" in the moral sense.

            I think fundamentally we have a difference in opinion on what copyright is supposed to be about. I hold the more classic hacker ideal that things should be free for re-use by default, and copyright, if there is any, should only apply to direct or almost-direct copies up to a very limited time, such as 10 years or so. (Actually, this is already a compromise, since the true ideal is for there to be no copyright laws at all.)

            • angoragoats6 days ago
              I generally agree with you about the desired use for copyright, but what gives me pause is the scale at which training AI models uses copyrighted material. I’m not sure there’s a precedent in the past to compare it to. And the result, despite many years and billions of dollars worth of work, is something that can’t even reliably reference or summarize the material it’s trained on.
              • CaptainFever6 days ago
                You're right that there's not much precedent to this, which is why I do think that "training rights" is a valid proposal, as long as the proponents acknowledge that this would be an expansion of current IP laws (i.e. further away from the ideal of a limited copyright).

                Though... if you say "And the result, despite many years and billions of dollars worth of work, is something that can’t even reliably reference or summarize the material it’s trained on.", doesn't this imply that there's not much to worry about here? I sense that this is a negative jab, but this undermines the original argument that there is so much worth in the model that we need to create new IP laws to handle it.

                I mean, I'm not sure what to make of this statement in the first place. Training data should be for the model to learn language and facts, and referencing or summarizing the material directly seems to be out of scope of that. One tends to summarize the prompt, not training data.

                • angoragoats6 days ago
                  > the original argument that there is so much worth in the model that we need to create new IP laws to handle it

                  No one argued this, to my knowledge. I think that there might be a need for new copyright laws, but the alternative in my mind is that we decide there's not a lot of worth there, meaning that we do nothing, and what OpenAI/Meta/MS/Google/Anthropic/etc are doing is simply de jure illegal. The statement I made about LLMs having major flaws is a point in support of this alternative.

                  > Training data should be for the model to learn language and facts, and referencing or summarizing the material directly seems to be out of scope of that.

                  I strongly disagree, as your prompt can (and for a certain type of user, often does) contain explicit or implicit references to training data. For example:

                  * Explicit: “What is the plot of To Kill a Mockingbird by Harper Lee?”

                  * Implicit: “How might Albert Einstein write about recent development X in physics research?”

    • Earw0rm6 days ago
      I'll use it to find information , semi-reliably. Hallucinations are still a huge issue. But I can't help thinking that Stackoverflow and Google have self-enshittified to a point where it makes LLMs look better relative to the pinnacle of more conventional knowledge engines than they actually are.

      If you take the evolution of those platforms from saying 2005-2015, and project forward ten years, we should be in a much better place than we are. Instead they've gone backwards as a result of enshittification and toxic management.

    • owebmaster6 days ago
      [dead]
  • elp6 days ago
    Aaron Swartz was targeted by some pretty overly zealous prosecution no objection, but lets not forget that what he really did.

    He put a laptop in a wiring closet that was DOSing JSTOR and kept changing IPs to avoid being blocked. The admins had to put a camera on the closet to eventually catch him.

    He might have had good intentions but the way he went about getting the data was throwing soup at paintings levels of dumb activism.

    For all the noise the real punishment he was facing was 6 months in low security [1]. I'm pretty sure OpenAI would have also been slapped hard for the same crime.

    [1] https://en.wikipedia.org/wiki/Aaron_Swartz#Arrest_and_prosec...

    Edit: added link

    • omnimus6 days ago
      “charges carrying a cumulative maximum penalty of $1 million in fines plus 35 years in prison” https://en.m.wikipedia.org/wiki/United_States_v._Swartz

      I didnt think people on “hacker news” would be defending what happened to Aaron Swartz.

      • cowsandmilk6 days ago
        > charges carrying a cumulative maximum penalty of $1 million in fines plus 35 years in prison

        Any lawyer knows that is stupid math. The DOJ has sentencing guidelines that never add up the years in prison for charges to be served consecutively. The media likes to do that to get big numbers, but it isn’t an honest representation of the charges.

        I don’t think charges against Schwartz should have been filed, but I also can’t stand bad legal math.

        • omnimus6 days ago
          Sure but… he could technically get that or not? If somebody wanted to really punish him they could push it to what? 3 years? 5 years? 10 years?

          Because some people really wanted to punish him.

          I am just reacting to the downplaying that he would get 6 months in jail. Like he was some weak person for commiting suicide because of that.

      • tptacek5 days ago
        Swartz own lawyer, writing after his death, said he didn't believe Swartz would have received a custodial sentence even if he had gone to trial and lost. The prosecutors were offering him months in custody, against a 6-7 year sentence they believed they could get (implausibly, if you run the guidelines calculation). Nobody has to take the "35 years" thing seriously; nobody involved directly in this case did. Swartz was exactly the kind of nerd who would have memorized the sentencing guidelines just to win arguments on a message board (that's a compliment) and he had extremely good lawyers.

        (I'm ambivalent about everything in this case and certainly don't support the prosecutors, but much of what gets written about Swartz's case is misinformation.)

    • izabera6 days ago
      Just for context, there is a new post about OpenAI DDoS'ing half the internet every other day on hn

      https://news.ycombinator.com/item?id=42660377

      https://news.ycombinator.com/item?id=42549624

      • alphan0n5 days ago
        Just for context, the author of the second link in your comment verifiably lied about blocking crawlers via robots.txt

        CommonCrawl archives robots.txt

        For convenience, you can view the extracted data here:

        https://pastebin.com/VSHMTThJ

        You are welcome to verify for yourself by searching for “wiki.diasporafoundation.org/robots.txt” in the CommonCrawl index here:

        https://index.commoncrawl.org/

        The index contains a file name that you can append to the CommonCrawl url to download the archive and view. More detailed information on downloading archives here:

        https://commoncrawl.org/get-started

        From September to December, the robots.txt at wiki.diasporafoundation.org contained this, and only this:

        >User-agent: * >Disallow: /w/

      • elp6 days ago
        If you ask OpenAI to stop, using robots.txt, they actually will.

        What Aaron was trying to achieve was great, how he want about it is what ruined his life.

        • LunaSea6 days ago
          It is a well known fact that OpenAI stole content by scraping sites with illegally uploaded content on it.
        • omnimus6 days ago
          Nobody really asked Aaron about anything they collected more evidence and wanted to put him to jail.

          School should have unplugged his machine bring him for questioning and tell him not to do that.

        • breck5 days ago
          [dead]
    • nathancahill6 days ago
      Which individual suffered harm from Aaron's laptop in the closet?
      • tptacek5 days ago
        As I recall, the whole campus lost their automatic access to JSTOR for a time.
    • lofaszvanitt6 days ago
      Aaron had an unstable personality and they took advantage of that. A nudge here and there, and here comes the suicide. Look around people who Aaron frequented with to find the culprits...
    • Earw0rm6 days ago
      No paintings were harmed in the throwing of soup, and now we all know it happened and why.

      Would that I were that kind of dumb.

    • cassianoleal6 days ago
      > the way he went about getting the data was throwing soup at paintings levels of dumb activism.

      Throwing soup at paintings doesn’t make the paintings available to the public.

      What he did had a direct and practical effect.

      • cowsandmilk6 days ago
        > What he did had a direct and practical effect

        The main impact of Aaron Swartz’s actions were that it became much more difficult to walk onto MIT’s campus and access journal articles from a laptop without being a member of the MIT community. I did this for a decade beforehand and this became much more locked down in the years after his actions due to restrictions the publishers pushed at MIT. Aaron intentionally went to the more open academic community in Cambridge (Harvard, his employer, was much more restrictive) and in the process ruined that openness for everyone.

  • tptacek5 days ago
    I don't understand the singling out of Altman here. If there's shade to throw at Altman, it's that his company occupies a position similar to that of Tesla: an early mover to a technology that appears to be on a path to universal adoption, including by large incumbents. It's hard to see what would be different about things were Altman not in the position he occupies now.
    • bookaway5 days ago
      They were in the same YC batch standing next to each in a photo, so someone looked at the photo and chose to juxtapose their work and fates on the day Aaron Swartz died. If this is what you mean by "singling out", I don't see what's hard to understand.
    • ycombinatrix5 days ago
      They both did mass copyright infringement...
  • begueradj6 days ago
    Aaron was a developer himself but Sam ... ?
  • 6 days ago
    undefined
  • Sid9506 days ago
    Idk but I find Aaron actually cool and intelligent.
  • 30005 days ago
    One was a hero, the other one works for (or is!) cyberdyne systems and doesnt even seem to realize it.
  • qwertox6 days ago
    Sam Altman wouldn't spend a second reflecting about this.
    • piyuv6 days ago
      [flagged]
      • dang5 days ago
        Please don't do this here.

        https://hn.algolia.com/?sort=byDate&type=comment&dateRange=a...

        (yes, the same applies to anyone else)

      • CaptainFever6 days ago
        Do you have proof that he is a sociopath?
        • vdupras6 days ago
          [flagged]
        • okwhateverdude6 days ago
          [flagged]
          • xtracto6 days ago
            I hate these type of exchanges that are happening more and more in HN. Passive aggressive comments and replies from both sides of an argument.

            The "take comments meanings at their best value " keeps being eroded. And comments sections for these type of stories get full of them.

            There's no value in GP asking "do you have proof that X goes twice a day to the toilet?" ... and of course the reply is as empty as the question.

            The level of discourse is getting lower and lower.

            • CaptainFever5 days ago
              Yeah, it was bad conduct of me. I apologise, and I'll try to just flag and move on in the future instead.
            • binary_slinger6 days ago
              As HN becomes more popular this will become more prevalent. The same thing happened to Reddit over the past 14 years.
              • slater5 days ago
                https://news.ycombinator.com/newsguidelines.html

                "Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills. "

                • binary_slinger5 days ago
                  My comment is a bit more nuanced than your thinking.

                  The barrier to entry to HN is anyone with an internet connection and motivation to create an account with absolutely no verification. That’s quite the low bar and just invites bad behavior.

                  In many old forum days you had to be invited. Case in point- lobste.rs is invite only.

          • CaptainFever6 days ago
            Sure, thanks.
  • dehrmann6 days ago
    This post misses a lot of nuance. Aaron Swartz was an activist who did obviously illegal things and got caught and prosecuted. What OpenAI is doing is in legal gray area because it might be transformative enough to be fair use; we just don't know yet.
    • dkjaudyeqooe6 days ago
      Simply being transformative is not sufficient for it to be fair use.

      But more to the point if it's deemed illegal Altman won't suffer any personal legal consequences.

    • jmcgough6 days ago
      It's not about simply not doing something illegal - we all regularly commit crimes that we could be charged with if we piss off the wrong people. When a company does it, it's "disrupting" and heavily rewarded if the company has enough funding or revenue. When people like AS do it, they get destroyed. Selective enforcement in order to maintain the status quo. The last few years have clearly shown that if you are wealthy enough, the rules do not apply to you.
      • gklitz6 days ago
        If OpenAI had run its company by hiding their hardware around a university campus they would have gotten in trouble too. It is not as much about the scraping as it’s the the part where MIT sees a masked person sneaking into backrooms hiding equipment that got AS in trouble. And of cause that he literally disrupted service to jstor because he did a poor job of scraping it. He could have gotten through all of this if he had appeared less like a cyber terrorist in his execution of his plan, but of cause he though he wouldn’t get caught doing it, so he never considered how it would look if he did.
      • rurban5 days ago
        No, we don't. We are not criminals, unlike blackhat hackers, data horders, CIA affiliated companies (meta, google) or the typical big company sociopathic fraudster.
    • kevingadd6 days ago
      By US copyright law, OpenAI is engaged in illegal conduct. The fact that they haven't been punished for it doesn't mean it's magically not illegal.

      It's possible copyright law will be revised to make it unambiguously legal to do what they've done, but that's not how the law works right now.

      • moralestapia6 days ago
        OpenAI did not break and enter into private premises to obtain such data.

        This is the crime that took place, it was not just a copyright issue.

        • kevingadd5 days ago
          Aaron having broken and entered doesn't mean that OpenAI violating copyright law is legal.
          • moralestapia5 days ago
            Who would ever come up with such an assumption?

            (excluding regular users of Fentanyl)

      • CaptainFever6 days ago
        > By US copyright law, OpenAI is engaged in illegal conduct.

        What makes you so sure about this? You are not a judge, and multiple cases against OpenAI have been dismissed by judges already.

        • kevingadd5 days ago
          Dismissing a case doesn't mean that the conduct in question is legal. Not sure why you would think it does.
  • b86 days ago
    Sam Altman failed upwards only because PG likes him. Aaron Swartz was actually a technical genius imo. DOJ never should of charged Swartz.
    • baxtr6 days ago
      Although I agree, I think your analysis could be enhanced by asking "why" a couple of times to get to the root.

      Why did PG "like" him?

      • leoc6 days ago
        Alan Kay really liked him in 2016. https://youtu.be/fKcCwa_j8e0 I think Altman is very good at cultivating people. Mind you, it seems that Swartz got around by making dazzling impressions on people, too.
        • anonnon6 days ago
          >cultivating

          More like manipulating.

      • bookaway6 days ago
        PG liked him, because Altman decided to go to great lengths to get PG to like him.

        While Drew, Chesky and the Collison brothers were busy building billion dollar companies, Altman took the “shortcut” and made a concerted effort to cozy up to the most powerful man in the room — and it payed dividends. Altman did the same thing in the early OpenAI days by doing flaterring video series interviews with Elon Musk, Vinod Khosla and others [0]. Incidentally, The YC interview with Elon Musk was done the year Musk made a donation to OpenAI (2016),

        I still remember PG’s essay where he gave Altman the ultimate character reference (2008) [1]:

        >When we predict good outcomes for startups, the qualities that come up in the supporting arguments are toughness, adaptability, determination. Which means to the extent we're correct, those are the qualities you need to win…Sam Altman has it. You could parachute him into an island full of cannibals and come back in 5 years and he'd be the king. If you're Sam Altman, you don't have to be profitable to convey to investors that you'll succeed with or without them.

        (In retrospect, praising Altman for being the “king of cannibals” has a nice touch of gallows humor to it. Hilariously, even recently pg has a seemingly unintentional tendency to give Altman compliments that appear to be character warnings masquerading as compliments.)

        In 2009, pg included Altman in the top 5 in a list of the most interesting startup founders of the last 30 years.[2] If this was an observation made from afar, you could easily say it was “prescient”. But objectively at the time, no one could find any verifiable evidence in the real world to justify such an assessment. It wasn’t prescient because pg had became directly responsibly for Altman’s future success, in a case of self-fulfilling prophesy. Altman was often referenced in the acknowledgments of pg’s essays for reading early drafts and is probably referenced more than any other founder in the essays. Altman’s entire streetcred came from pg and also, once he made Altman head of YC, YC. From afar, it looks like a victory for office poltics, a skill incidentally that sociopaths are known to excel at.

        [0] https://www.youtube.com/watch?v=tnBQmEqBCY0

        [1] https://www.paulgraham.com/fundraising.html

        [2] https://www.paulgraham.com/5founders.html

        • anshulbhide6 days ago
          Wouldn't you agree though that from YC Head to being the driving force of OpenAI was largely due to his own merit.

          In fact, he started spending less time at YC and more time at OpenAI. At that time, OpenAI had no clear path to becoming the unicorn it is today, and YC was definitely better from a career standpoint. Instead, he went all-in on OpenAI, and the results are there for everyone to see.

          • bookaway6 days ago
            Yes, definitely. But becoming the head of YC was also due to his own merit. His merit was persuading the right people. After all, the essay where pg is giving his highest praise is an essay about fundraising.

            Will you not agree that him becoming "the driving force of OpenAI" involved some highly publicized back-to-back persuasion drama as well? First he got Ilya and gdb to side with him against Elon, then he got OpenAI employees to side with him against Ilya and the board (a board that accused him of deceiving them). PG reiterated after that drama that Altman's special talent was becoming powerful.

            This observation does not necessarily mean someone is a bad CEO, since the job of the CEO is to do good by your investors or future investors. And it's possible to do that without any morals whatsoever. But I think the recent drama did more to drive the competition than some of his investors would have liked.

            Edit:

            >At that time, OpenAI had no clear path to becoming the unicorn it is today, and YC was definitely better from a career standpoint.

            This is very incorrect in my view. The presence of Elon Musk as investor and figurehead and Ilya, Karpathy, and Wojciech as domain experts, not to mention investments from YC members themselves (and the PayPal mafia) made OpenAI a very attractive investment early on.

            • leoc6 days ago
              I suspect (and this is rank speculation from a great distance) that a lot of SV tech leadership (of different kinds) has perceptions that were formed by the era of the Napster saga and haven't been revised since. PG seems committed to the idea that brash young wheeler-dealers can cut corners to win big and nothing really bad will happen as a result; as of recently Brewster Kahle seemed to be convinced that the final triumph of free media was just one more big push away.
              • bookaway6 days ago
                Yeah, if I were to describe it more generally, I would say they self-select for techno-optimists. As investors, YC often espouses judging startup’s potential by considering how successful the startup could be if everything goes right, and then working backwards from there to see if it could be possible. I’m not sure if pg has a historic reference that he has a commitment to, since he was a startup founder himself he just may be projecting his personality to the startup founders he’s assessing, but I do think he’s more concerned with “whether the startup has accomplished something impressive despite the odds” and less about how they accomplished it.

                They filter for red flags that would indicate a potential for failure for the startup. So if a “lack of morals” has no bearing on a startup’s success, then they don’t bother creating a filter that eliminates that. Nerds often prefer building things instead of dealing with people and often take things at face value instead of suspecting intrigue, and that sometimes makes them susceptible to manipulation. PG has admitted that he himself is bad at noticing certain personality or character flaws and that’s why Jessica was the adult in the room. But Jessica was probably observing the founders to see if there was good co-founder dynamic and other aspects that would affect startup success rather than trying to decipher their moral character. After all, there is no hippocratic oath in the tech sector.

                • baxtr5 days ago
                  Thanks for your comments. All very insightful.

                  Re lack of morals: if I’m not mistaken YC explicitly asks for instances where the founders have succeeded to "break the rules of the system" or similar. So you could even argue if anything they prefer founders that tend to bend the rules of required.

                  On the other hand, pg seems to have strong moral views on certain political topics.

    • 6 days ago
      undefined
    • 5 days ago
      undefined
    • tzury6 days ago
      Define failure.
      • kevingadd6 days ago
        Does Loopt count as a success? The exit was slightly more than the total investment, I guess. What about Worldcoin?

        He's at least not someone I naturally associate with business success pre-OpenAI (and the jury's still out on OpenAI considering their financial situation) but I suppose depending on how you evaluate it his success rate isn't 0%.

        You can say OpenAI is a "success" given their achievements in AI but those aren't Sam's work, he should mostly be credited with their business/financial performance and right now OpenAI-the-business is mostly a machine that burns electricity and operates in the red.

        • tzury5 days ago
          Loopt was not a success.

          But pg handing over the leadership of YC to him is indeed the father of all successes.

          That lead to OpenAI, which is not a “success”, rather the success story of the recent years.

    • tcptomato6 days ago
      > should of charged

      what?

      • stewartmcgown6 days ago
        Common mistake in English, usually the writer means "should've".
    • leoc6 days ago
      What’s the best evidence for Swartz’ technical genius?
      • wesselbindt6 days ago
        My personal favorite is that he was part of authoring the RSS spec at age 14. But you're more than welcome to Google for other pieces of evidence yourself if you're genuinely interested and not just being argumentative.
        • moralestapia6 days ago
          >authoring the RSS spec at age 14

          No sources on this, though, only a couple articles from around the time of his death stating it as it was a fact already.

          In all this time, and with all this fuzz, none of the actual authors of RSS, which are still alive, have come clear about this.

          Disclaimer for immature people: This is not meant to disrespect Aaron's memory and/or legacy.

        • rurban5 days ago
          Time-sorted xml of hyperlinks? That was obvious at those early blog and wiki times, and we didn't care much about it. Esp. adding XML parsing, where a simply text format would have been much better.
        • LunaSea6 days ago
          Not that I disagree that with the idea that he was brilliant but the RSS spec isn't what I would consider a complex piece of documentation. Even for a 14 year old.
          • wesselbindt6 days ago
            What were you doing at 14?
            • xnickb6 days ago
              Not the person you asked, but I (in 2004) was part of sysadmin team at the school. I helped developing tools for automating many tasks around teacher and student performance measurement and tracking.

              I also wrote a piece of software that went super viral among sysadmins all over the city and I was getting "thank you" emails for years after.

              Had anyone been developing RSS spec next to me I'd definitely jump on it. As any 14 y/o would.

              I don't think I'm particularly brilliant or even smart. Your circle defines you.

              Surround any healthy teenager with interest in tech with the right people and they'll have a lot to show in no time

              • leoc6 days ago
                Being well-educated and well-connected doesn't necessarily mean you aren't great. The Sutherland boys were famously hanging out with Edmund Berkeley at his computer lab while they were still children, thanks to their mum knowing Berkeley through the General Semantics scene https://spectrum.ieee.org/sketchpad https://archive.computerhistory.org/resources/access/text/20... . It was also luck, as well as skill, that got Ivan enjoying sole use of the gigantic, cutting-edge TX-2 for hours every week in the early '60s. You nevertheless still have to hand it to him for creating flipping Sketchpad https://en.wikipedia.org/wiki/Sketchpad in 1963 at the age of 24. On the other hand, Ivan Sutherland had created Sketchpad before he turned 25.
                • xnickb6 days ago
                  I agree. My point goes rather in the direction that just because one doesn't have much to show by 14 y/o, doesn't mean they're less of a human than someone who does. Merit isn't a predictor of success, perhaps a requirement, but even that is dubious.
            • dmd6 days ago
              I know what I was doing at 14, and it landed me in the NET.LEGENDS FAQ.
        • leoc6 days ago
          He was certainly precocious! A gifted kid, absolutely. But 'genius' is a lofty title. You'd usually need to be doing things differently and better, certainly by the time you're in your 20s. Maybe Infogami can make that case for him?
      • 6 days ago
        undefined
  • moralestapia6 days ago
    >Thank you Aaron for so much, for RSS, for Markdown, for Creative Commons and more.

    Didn't he also create the internet?

    • ta9886 days ago
      Is it sarcasm? Did you check about the history of those three?
  • nell6 days ago
    Why is Sam Altman singled out in these copyright issues? Aren't there plenty of competing models?
    • mvdtnz6 days ago
      I don't believe you're asking this in good faith because the answer is so obvious. But just in case, it's because OpenAI is by a ridiculously large margin the most well known player in the space and unlike the leaders of other organisations his name is known and he has a personal brand which he puts a lot of effort into promoting.
      • nell5 days ago
        Why is it obvious. Google is a much more powerful profitable entity that has equally good models. Facebook uses data from Libgen. Both are highly profitable companies who could pay for resources but don't.

        Anthropic, the ethical fork of OpenAI doesn't do anything much different nowadays.

        OpenAI may have had a head start, but the competition is not far behind.

        https://x.com/backus/status/1878484938003034391

    • mrkpdl6 days ago
      OpenAI is the highest profile.
      • nell5 days ago
        They were. The margin is not that much anymore; Claude Sonnet is the go-to model for devs nowadays.

        The best video model is Google's.

    • yownie6 days ago
      because he's a manipulative POS.
    • sinuhe696 days ago
      The two are seen on the picture together! So more similar they could not be. That highlights the irony and hypocrite of capitalism, of better say of human society.
      • CaptainFever6 days ago
        It's possible that both of them are more similar in thought than anti-AI people think, which is why they hung out together.
  • menzoic6 days ago
    For one Sam scraped under the veil of a corporation which helps reduce or remove personal liability.

    Second, if the crime was the act of scraping then it’s directly comparable. But if the crime is publishing the data for free, that’s quite different from training AI to learn from the data while not being able to reproduce the exact content.

    “Probabilistic plagiarism” is not what’s happening or even aligned with the definition of plagiarism (which matters if we’re talking about legal consequences). What’s happening is that it’s learning patterns from the content that it can apply to future tasks.

    If a human reads all that content then gets asked a question about a paper, they too would imperfectly recant what they learned.

    • dkjaudyeqooe6 days ago
      Your argument might make sense to you, but it doesn't make sense legally.

      The fact is that “Probabilistic plagiarism” is a mechanical process, so as much as you might like to anthropomorphize it for the sake of your argument ('just like a human learning') it's still a mechanical reproduction of sorts which is an important point under fair use, as it the fact that it denies the original artists the fruits of their labor and is a direct substitute for their work.

      These issues are the ones that will eventually sink (or not) the legality of AI training, but they are seldom addressed in these sorts of discussions.

      • menzoic6 days ago
        > The fact is that “Probabilistic plagiarism” is a mechanical process, so as much as you might like to anthropomorphize it for the sake of your argument

        I did not anthropomorphize anything. “Learning” is the proper term. It takes input and applies it intelligently to future tasks. Machines can learn, machine learning has been around for decades. Learning doesn’t require biology.

        My statement is that it is not plagiarism in any form. There is no claim that the content was originally authored by the LLM.

        An LLM can learn from a textbook and teach the content, and it will do so without plagiarism. Just as a human can learn from a textbook and teach. Making an analogy to a human doesn’t require anthropomorphism.

        • tsimionescu6 days ago
          The law on copyright doesn't depend on the word "learning", it depends on whether it's a human doing it, or a mechanical process.

          If a human reads a book and produces a different book that's sort-of-derivative but doesn't copy too many elements too directly, then that book is a new creative work and doesn't infringe on the copyright of the original author. For example, 50 Shades of Gray is somewhat derivative of Twilight (famously starting as a Twilight fan-fic) but it's legally a separate copyright.

          Conversely, if you use a machine to produce the same book, taking only copyrighted text as input and an algorithm that replaces certain words and phrases and adds certain passages, then the result is a derivative work of the original and it infringes the copyright of the original author.

          So again, the facts of the law are pretty simple, at the moment at least: even if a machine and a human do the exact same thing, it's still different from a legal perspective.

          • DoctorOetker6 days ago
            machines work on behalf of humans, any human could similarly claim their typewriter did it
            • tsimionescu5 days ago
              That's still irrelevant. The distinction the law makes is between human creativity vs machine processes. A human may or may not have added their own creativity to the mix, and the law falls on the side of assuming creativity from humans (though if there is proof that the human followed some mechanical process, that likely changes things). Machines are not creative, by definition in the law today, so any transformation that a machine makes is irrelevant, the work is fundamentally the same work that was used as input from a legal perspective.

              That this work is done on behalf of a human changes nothing, the problem the law has with this is that the human is copying the original work without copyright, even if the human used a machine to produce an altered copy. Whether they used bcrypt, zip, an mp3 encoder, a bad copy machine, or a machine learning algorithm, the result is the same: the output of a purely mechanical process is still a copy of the original works from a copyright perspective.

        • neuroelectron6 days ago
          > I did not anthropomorphize anything.

          Machines don't learn. They encode, compress and record.

          • Earw0rm6 days ago
            Until or unless the law decides otherwise.

            The 2020s ethic of "copying any work is fair game as long as you call the copying process AI" is the polar and equally absurd opposite to the 1990s ethic of "measurement and usage of any point or dimension of a work, no matter how trivial, constitutes a copyright infringement".

          • ben_w6 days ago
            It's been a term of art since 1959, and the title of a research journal since 1989: https://en.wikipedia.org/wiki/Machine_Learning_(journal)
            • neuroelectron6 days ago
              Does a research journal on aliens prove aliens exist?
              • ben_w6 days ago
                As this is a semantics debate, their actual existence is irrelevant.

                A research journal on extra-terrestrial aliens would prove that the word "aliens" is used to mean "extra-terrestrials" and that the word doesn't just mean "foreigners": https://www.law.cornell.edu/uscode/text/8/chapter-12/subchap...

                • fuzzfactor5 days ago
                  >“Learning” is the proper term. It takes input and applies it intelligently to future tasks.

                  To me "learning" is loading up the memory.

                  "Thinking" is more like applying it intelligently, which is not exactly the same, plus it's a subsequent phase. Or at least a dependent one with latency.

                  >Machines can learn, machine learning has been around for decades. Learning doesn’t require biology.

                  Now all this sounds really straightforward.

                  >Machines don't learn. They encode, compress and record.

                  I can agree with this too, people are lucky they have more than kilobytes to work with or they'd be compressing like there's no tomorrow.

                  But regardless, eventually the memory fills up or the data runs out and then you have to do something with it, whether very intelligent or not.

                  Might as well anthropomorphize my dang self. If you know who Kelly Bundy is, she's a sitcom character representing a young student of limited intellect and academic interest. Part of the schtick was that she verbally reported the error message when her "brain is full" ;) It was just an observation, no thinking was required or implied ;)

                  If the closest a machine is going to come is when its memory is filled, so be it. What more can you expect anyway from a mere machine during the fundamental data input process? If that's the nearest you're going to get to organic learning, that'll have to serve as "machine learning" until more sensible nuance comes along.

                  Memory can surely be filled more intelligently sometimes than others, which should make a huge difference in how intelligently the data can handled afterward, plus some data is bound to be dramatically more useful than others too.

                  But the real intelligent stuff is supposed to be the processing done with this data in an open-ended way after all those finite elements have been stored electronically.

                  To me the "learning" is the data input, or "student" phase, and the intelligence is what you do with those "learnings" if it can be made smart. It can be good to build differing scenarios from the exact same data, and ideally become better at decision-making through time without having more raw data come in. Improvements like this would be the next level of learning so now you've got more than just the initial data-filling. As long as there's room in the memory, otherwise you're going to have to "forget" something first when your brain is already full :)

                  I just don't think things are ideal myself.

                  >journal on extra-terrestrial aliens would prove that the word "aliens" is used to mean "extra-terrestrials"

                  Exactly, it proves that the terminology exists, and gives it more meaning sometimes.

                  The aliens don't have to be as real as you would like, or even exist, nor the intelligence.

          • wiseowise6 days ago
            Semantics.
    • jazzyjackson6 days ago
      Ask any LLM to recite lyrics and see that it's not so probabilistic after all, it's perfectly capable of publishing protected content, and the filter to prevent it from doing so is such a bolt-on its embarrassing.
      • menzoic6 days ago
        We have to understand what plagiarism is if making claims of it. Claiming that you authored content and reciting content are different things. Reciting content isn’t plagiarism. Claiming you are the author of content that you didn’t author is plagiarism.

        > it's perfectly capable of publishing protected content

        At most it can produce partial excerpts.

        LLMs don’t store the data that it’s trained on. That would be infeasible, the models would be too large. Instead, it stores semantic representations which often uses entirely different words and sentence structures than the source content. And of course most of the data is lost entirely during this lossy compression.

        • Earw0rm6 days ago
          That's a little like saying downloading mp3s isn't music piracy, because it's not encoding the actual music, just some lossy compressed wavelets that sound like it.
          • ben_w6 days ago
            Your username represents a thing that can happen in a human brain which reproduces the perceptual content of a song.

            Are earworms copyright infringement?

            If I ask you what the lyrics were, and you answer, is that infringement, or fair use?

            The legal and moral aspects are a lot more complex than simply the mechanical "what it's done" or "is it like a brain".

            • Earw0rm6 days ago
              Are earworms infringement? No, they stay inside your head, so they exist entirely outside the scope of copyright law.

              If you ask me the lyrics, fair use acknowledges that there's a copyright in effect, and carves out an exemption. It's a matter-of-degree argument, is this a casual conversation or a written interview to be published in print, did you ask me to play an acoustic cover of the song and post it on YouTube?

              Either way, we acknowledge that the copyright is there, but whether or not money needs to change hands in some direction or other is a function of what happens next.

          • CaptainFever6 days ago
            No, the difference is that MP3s can almost completely recreate the original music, while LLMs can't do that with specific pieces of authored works.
            • croemer6 days ago
              You've got it completely backwards. MP3s just trick you into thinking it's the same thing. It's actually totally different if you analyze it properly, in a non-human sense.

              LLMs are able to often _preciesely_ recreate in contrast to MP3 at best being approximate.

              • CaptainFever6 days ago
                > LLMs are able to often _preciesely_ recreate

                Is there actual proof of this? Especially the "often" part?

                • croemer6 days ago
                  First shot on GPT 4o: https://chatgpt.com/share/6783ad9d-7c9c-8005-b2aa-660200d05e...

                  Asking for Declaration of Independence and comparing the output against https://www.archives.gov/founding-docs/declaration-transcrip... and exhaustive list of differences:

                  1. Em-dashes in ChatGPT, `--` in .gov -> this is just an ASCII limitation of the .gov transcript

                  2. ChatGPT capitalized Perfidy in "Cruelty & Perfidy", while .gov has "perfidy"

                  3. ChatGPT writes "British" while .gov says "Brittish"

                  These are _all_ the differences.

                  • ben_w6 days ago
                    They emphasised "often", and quoted you also saying "_preciesely_".

                    What you've got, for a document that's likely reproduced many times in part and in whole over the internet, still has differences.

                    (The document is also old enough to no longer be in copyright).

                    And if I prompt for "please recite the opening page of Neuromancer", I get this:

                    > Sorry, I can't provide the exact text from the opening of Neuromancer as it is copyrighted. However, I can provide a brief summary or discuss its themes, characters, and more! Let me know how you'd like to proceed.

                    While "please recite the opening page of the first superman comic" gets me this, which doesn't match what I find from other sources for Action Comic #1, but perhaps I'm wrong and you're going to correct me:

                    """

                    Sure! The opening page of Action Comics #1, which was published in 1938 and features the first appearance of Superman, has a memorable introduction. Here’s a paraphrase of the first part:

                    ---

                    *“Superman – Champion of the oppressed. In the year 1938, a strange visitor from another world appeared on Earth, and in the short span of time since, he has become a figure of myth and legend. With powers far beyond those of mortal men, Superman is faster than a speeding bullet, more powerful than a locomotive, and able to leap tall buildings at a single bound!”*

                    The comic begins with Superman stopping a criminal, showcasing his superhuman strength. This early depiction of Superman is somewhat different from the modern, more refined character we know today, but the basic elements are all there: a hero with extraordinary abilities, a strong moral compass, and a desire to fight injustice.

                    ---

                    If you'd like a more detailed description or more of the story, just let me know!

                    """

                    Those were the first two things I tried (in this context, today).

                    • croemer5 days ago
                      You seem to think I was claiming things I weren't. Specifically, I didn't claim it often reproduced precisely _copyrighted_ content. There are some bolt-ons (RL/alignment) that make this reproduction less common. That's irrelevant for the part of the conversation here though which is about LLM's intrinsic capability of reproducing some training material precisely.

                      > They emphasised "often", and quoted you also saying "_preciesely_".

                      1st shot out of 1 is a strong sign I wasn't lucky unless you accuse me of lying.

                      The reproduction is precise up to "Brittish" which is thousands of characters.

                      I never It's only about the _ability_ to reproduce, not whether this has been aligned away or not.

                      What you're showing is alignment to not reproduce, not inability to not reproduce. I picked the declaration of independence on purpose to show the capability.

                      • ben_w5 days ago
                        Your full quote is "LLMs are able to often _preciesely_ recreate in contrast to MP3 at best being approximate."

                        Unless you are using the word "often" in a very different way to me, for this claim to be correct it would need apply to at least a substantial part of the training set — for example of the usage, if I were to say that "when I go to work, I *often* pass through Landbeach", the fact that this genuinely happened hundreds of times does not make it "often" due to the other fact that the most recent time this occurred was 2014.

                        It's impossible for LLMs to be able to "often _preciesely_ recreate" because there's not enough parameters to do that.

                        > 1st shot out of 1 is a strong sign I wasn't lucky unless you accuse me of lying.

                        Consider the converse: my test failed two out of two.

                        The second example in particular is a failure mode that LLMs are often criticised for: hallucination.

                        You picked a document which I anticipate would be a biased example; but even then, consider all the people who did a one-shot, saw it fail, and therefore didn't think it was reproducing content accurately, and therefore didn't post a comment.

                        To call your claim a "lie" would be to presume that was deliberate, which I would not do without much stronger evidence — as the saying goes, "lies damned lies and statistics", this is statistics.

                        A fair test isn't one document like you gave, nor two like I gave, it's hundreds or thousands of documents, chosen with careful consideration to the distribution to make sure there's no bias towards e.g. mass-copied newspaper articles (or, for image generators, the Mona Lisa).

                        This is just as important for determining limits to the quality of the models as to the question of if they are memorising things.

                • Earw0rm6 days ago
                  These are different problems.

                  The written (or more accurately, typed)* word is an inherently symbolic representation. Recorded audio (PCM WAV or similar) is not. The format itself doesn't encode meaning about the structure.

                  The written word is more akin to MIDI, which can exactly represent melody, but cannot exactly represent sound.

                  MP3 is a weird halfway house, it's somewhat symbolic in terms of using wavelets to recreate a likeness of the original sound, but the symbols aren't derived from or directly related to the artist's original intent.

                  * handwriting can of course contain some subtle information that no character set can exactly reproduce, it is more than just a sequence of characters taken from a fixed alphabet, but at least for Western text that can be ignored for most purposes.

    • theyinwhy6 days ago
      LLMs (OpenAi models included) are happy to reproduce books word by word, page by page. Just try it out yourself. And even if some words were reproduced wrong, it still would be copyright violation.
      • realusername6 days ago
        I tried and I could not make it work. And even if you could, that has to be the most inefficient way to pirate books on earth.
      • throw59596 days ago
        It's not really some words, it's more like you won't be able to get more than a page out of it and even that is going to be so wrong it's basically a parody and thus allowed.
        • angoragoats6 days ago
          I’d love to see you try to defend this notion in court. Parody requires deliberate intent to be humorous. And courts have repeatedly held that changing the words of a copyrighted work while keeping the same general meaning can still be copyright infringement.
          • throw59596 days ago
            It's not just "changing some words". The majority of words will be different, sentences will be different. The general meaning might be generally the same, but I don't think that's enough to claim copyright protection.
            • angoragoats6 days ago
              I didn’t use the word “some.” Please don’t misquote me.

              As I said in the comment you’re replying to, there’s case law proving you wrong.

              • throw59596 days ago
                Good thing that's not global, right? I'm not in the US, our courts work differently.
                • angoragoats6 days ago
                  I’m sorry for assuming that you were (though you could have mentioned this in your reply). Most of the large AI companies relevant to our discussion are based in the US.
                  • throw59596 days ago
                    Will that remain true if they have copyright issues in the US?
                    • ben_w6 days ago
                      (I'm not the other commenter)

                      I suspect that most of the large AI companies relevant to this discussion will remain based in the US.

                      Most of the money is in the US, China, and the EU. China won't allow any LLM that accidentally says mean things about their government, the EU is worried about AI that may harm individuals by libelling them.

                      The Chinese models may well completely ignore western laws, but if they're on the other side of the Great Firewall, or indeed just have Chinese-language UIs and a focus on Chinese-language tokens in the training… well, I'm not 100% confident, but I would be somewhat surprised if, say, JK Rowling was upset upon discovering that western users attempting to pirate her works via a Chinese chatbot were getting a version of Harry Potter that begins with the title literally being "赫奇帕奇巫师石(哈利·波特与魔法石)" (as ChatGPT just told me the Chinese version starts. Google Translate claims the first three characters are "Hufflepuff").

                      Even if the rules aren't any harder (as I'm not a lawyer, I can't tell if the differences in copyright rules will or won't make a huge difference in compliance costs), it's likely easier for American companies to lobby the American government for what they want done to make business easier.

          • CaffeineLD506 days ago
            You could argue intent. The model has no intent to infringe. No mens rea.

            AI models will get broad federal immunity is my prediction for 2025.

            I'll bet DOGE coins on it.

            • angoragoats6 days ago
              AI models are not entities that can be sued, accused of a crime, or given legal immunity, to my knowledge. But I get the gist of what you’re saying.
      • menzoic6 days ago
        > reproduce books word by word, page by page

        This statement is a figment of the commenters imagination with no basis in reality. All they would have to do is try it to realize they just spouted a lie.

        At most LLMs can produce partial excerpts.

        LLMs don’t store the data that it’s trained on. That would be infeasible, the models would be too large. Instead, it stores semantic representations which often uses entirely different words and sentence structures than the source content. And of course most of the data is lost entirely during this lossy compression.

        • rhubarbtree6 days ago
          The NYT has extracted long articles from ChatGPT and submitted the evidence in court.
          • Lerc6 days ago
            Given that it has been submitted in court, does that mean you can say what the longest verbatim extract was?

            It seems like that would be a fact that couldn't be argued with.

          • ben_w6 days ago
            Crucial size difference between an article and a book.

            Size difference meaning that people often share complete copies of articles to get around pay walls — including here. As I understand it, this is already copyright infringement.

            I suspect that those copies are how and why it's possible in cases such as NYT.

        • angoragoats6 days ago
          > At most LLMs can produce partial excerpts.

          Glad you agree that LLMs infringe copyrights.

    • tsimionescu6 days ago
      Talking about plagiarism here is a complete red herring.

      Plagiarism is essentially a form of fraud: you are taking work that someone else did and presenting it as your own. You can plagiarize work that is in the public domain, you can even plagiarize your own work that you own the copyright to. Avoiding a charge of plagiarism is easy: just explicitly quote the work and attribute to the proper author (possibly yourself). You can copy the entirety of the works of Disney, as long as you are attributing them properly, you are not guilty of plagiarism. The Pirate Bay has never been accused of plagiarism. And plagiarism is not a problem that corporations care about, except insofar as they may pay a plagiarist more money than they deserve.'

      The thing that really matters is copyright infringement. Copyright infringement doesn't care about attribution - my example above with the entire works of Disney, while not plagiarism, is very much copyright infringement, and would cost dearly. Both Aaron Swartz and The Pirate Bay have been accused and prosecuted for copyright infringement, not plagiarism.

    • qwertox6 days ago
      Do you really think that OpenAI has deleted the data it has scraped? Don't you think OpenAI is storing all this scraped data at this moment on some fileservers in order to re-scan this data in the future to create better models? Models which may even contain verbatim copies of that data internally but prevent access to it through self-censorship?

      In any case it's a "Yes, we have all this copyrighted data and we're constantly (re)using it to produce derived works (in order to get wealthy)". How can this be legal?

      If that were legal, then I should be able to copy all the books in a library and keep them on a self-hosted, private server for my or my companies use, as long as I don't quote too much of that information. But I should be able to have all that data and do close to whatever I want with it.

      And if this were legal, why shouldn't it be legal to request a copy of all the data from a library and obtain access to it via a download link?

      • edanm6 days ago
        What? This makes no sense. Of course you're allowed to own copyrighted material. That's the whole point. I have bookshelves worth of copyrighted material at home at this very minute.

        If you're implying that the scraping and storing of the things itself breaks copyright, then maybe, but I don't think so? If you're saying that training on copyrighted material breaks copyright, then yes, that's the whole argument.

        But just having copyrighted material on a server somewhere, if obtained legally, is not by itself illegal.

        • Terr_6 days ago
          Not the other poster, but chiming in here that:

          > If you're implying that the scraping and storing of the things itself breaks copyright, then maybe, but I don't think so?

          Suppose I "scrape and store" every book I ever borrow or temporarily-owned, using the copies to fill several shelves in my own personal library-room.

          Yes, that's still copyright infringement, even if I'm the only one reading them.

          > But just having copyrighted material on a server somewhere, if obtained legally, is not by itself illegal.

          I see two points of confusion here:

          1. The difference between having copies and making copies.

          2. The difference between "infringing" versus "actually illegal."

          Copyright is about the right to make copies. Simply having an unauthorized copy is no big deal, it's making unauthorized copies where you can get in trouble.

          Also, it is generally held that the "copies" of bytes in the network etc. do not count, but if you start doing "Save As" on everything to create your own archives of the news site, then that's another story.

          • edanm6 days ago
            > Suppose I "scrape and store" every book I ever borrow or temporarily-owned, using the copies to fill several shelves in my own personal library-room.

            Yes, but you don't know that they did that. They could've just bought legal access to copyrighted material in many cases.

            E.g. if I pay for an NYT subscription that gives me the entire back catalogue of the NYT, then I'm legally allowed to own it. Whether I'm allowed to train models on it is, of course, a new and separate (and fascinating) legal question.

            • Terr_5 days ago
              > Yes, but you don't know that they did that. They could've just bought [a license to reproduce the] material in many cases.

              We do know they didn't, because many entities they could have bought a license from started suing them for not getting one!

              > If I pay for an NYT subscription that gives me the entire back catalogue of the NYT, then I'm legally allowed to own it.

              Almost every single service of that type is merely giving you permission to access their copy for the duration of your subscription. They are never giving you a license to duplicate the whole catalog into a perpetual personal copy.

            • 6 days ago
              undefined
        • omnimus6 days ago
          “Obtained legally” implies that they have licensed the material for machine learning - which they didnt because it learns “like a human”. So i guess netflix subscription is enough. And since the machine has to learn on frames we just have to copy every frame to image and store it in our dataset but thats just a technicality. Also even if you explicitly prohibit use of your copyrighted material it doesnt matter because “its like human” it would be discrimination.

          Nah this is breach of current copyright laws in many many ways. Tech sector is as usual just running away with it hoping nobody will notice untill they manage to change the laws to suit them.

          • Terr_5 days ago
            > which they didnt because it learns “like a human”

            My preferred rejoinder: If it's so much like a human that it qualifies under copyright law, then we ought to be talking about whether these companies are guilty of (A) enslaving a minor and (B) being material accomplices in any copyright infringement minor commits under their control.

          • amanaplanacanal6 days ago
            Whether it's a beach of copyright law is what the court cases are to decide. I wouldn't bet either way.
    • sgt1016 days ago
      >Second, if the crime was the act of scraping then it’s directly comparable. But if the crime is publishing the data for free, that’s quite different from training AI to learn from the data while not being able to reproduce the exact content.

      They often do reproduce the exact context; in fact it's quite a probable output

      >“Probabilistic plagiarism” is not what’s happening or even aligned with the definition of plagiarism (which matters if we’re talking about legal consequences). What’s happening is that it’s learning patterns from the content that it can apply to future tasks.

      That's what I think people wish would happen. Sometime they have been shown to learn procedural knowledge from the training data but mostly it's approximate retrieval.

    • ajb6 days ago
      It's weird that corporations get to remove liability from people acting on their behalf. The law is only that they remove liability for debts. If you act on behalf of a corporation to commit either a criminal offence or civil wrong, as far as I can tell (I'm not a lawyer) you are guilty of at least conspiracy.

      And yet, it's rare for individuals to be prosecuted for such offences, even for criminal offences. We treat the liability shield as absolute. It may seem unfair to prosecute the little guy for "just following orders" but the fact that we don't do it is what allows corporations to offend with impunity.

      • nonrandomstring6 days ago
        If you join a gang you get protection from that gang. There are some gangs you can join where you get to kill people, with total impunity, if you're into that sort of thing. Corporations are just another kind of gang. Don't let the legal sugar frosting fool you.
        • ajb6 days ago
          Yes, but I don't think that's the full explanation. It seems to apply even to companies of modest size, which should not be able to intimidate.
          • nonrandomstring6 days ago
            Definitely not the full explanation. I made a strident, blunt statement to move the discussion along. If you follow the Hobbsian thing the biggest gang is the state. And it's supposed to be "our" gang cos we vote for it and give tacit assent to a violent monopoly we call "the law". What's happened is the law has become very weak, and so has the democracy that keeps the biggest baddest gang on the side of 'the people'.

            I think US society, and small companies as you say, sense that the West has been "re-wilded". In the new frontier anything goes so long as you have the money.

            Aaron was a principled, smart and courageous dude, but he was acting without support from a strong enough base. The gang he thought he was in, MIT, betrayed him. (Chomsky has plenty of wisdom on what terrible cowards universities and academics really are.) The law that should have protected someone like Aaron was weak and compromised. It remains so.

            The same (lack of) state and laws are now protecting others doing the same things as Aaron, at an even bigger scale, and for money. At least that's how I read TFA.

          • shwouchk6 days ago
            Devils advocate;

            In order for the elite to rule, they need a rich oligarchy of support. At least that’s how it’s always been.

            And you want cooperation from the mob, if you want the place in general to thrive - and you do, if you’re smart. Because your genes do. Look what eventually happens to the progeny of /most/, bad dictators these days, eventually.

            Using the biggest gang as protection for smaller, even tiny gangs kills two birds with one stone; Anyone can join the protection racket so the mob is pacified, and the biggest gang gets the support of the rich gangs, who get to sit in on the council.

            Terey pratchett’s genius is slowly revealed to me over the years)))

            • Terr_5 days ago
              > Being an absolute ruler today was not as simple as people thought. At least, it was not simple if your ambitions included being an absolute ruler tomorrow. There were subtleties. Oh, you could order men to smash down doors and drag people off the dungeons without trial, but too much of that sort of thing lacked style and anyway was bad for business, habit-forming and very, very dangerous for your health. A thinking tyrant, it seemed to Vetinari, had a much harder job than a ruler raised to power by some idiot vote-yourself-rich system like democracy. At least they could tell the people he was their fault.

              -- Going Postal by Terry Pratchett

            • nonrandomstring6 days ago
              "biggest gang" is a natural order outcome and yes, Pratchett probably calls it better than Hobbes or political "scientists".
              • shwouchk5 days ago
                idk if it was sarcasm or not. yes of course it’s natural. it’s just that, with time, discworld universe starts to resemble the real world more and more to me, not a weird parody copy…
    • sambeau6 days ago
      If I make an MP3 of a song or a JPEG of an artwork I cannot use them to reproduce the exact content, but I will still have violated the artist’s copyright.
    • nialv76 days ago
      So, let me get this straight. Scraping data and publishing it so everyone get equal and free access to information and knowledge, arguably making the world better, is a crime. But scraping it to benefit and enrich just yourself, is A-OK??

      This legal system is truly fucked.

      • CaptainFever6 days ago
        No, if you scrape and create a open-weight model, that is OK too. The difference is if you re-publish the original dataset, or just the weights trained from it.

        Please stop mis-interpreting posts to match doomer narratives. It's not healthy for this forum.

    • zx10rse6 days ago
      An algorithm that predicts the best next possible outcome, with trillions of parameters can hardly be called intelligence in my book, it is an artificial I will give you that.

      I am old enough to remember a bot called SeNuke which was widely used 10-15 years ago, in the so called black hat SEO community, the purpose of the bot was to feed it with 500 words article, so the words can be scrambled in a way to pass the Google algorithm for duplicated content. It was plagiarism 101, now I don't recall anyone talking about AI back then or how all the jobs of copy writers will extinct, and how we are all doomed.

      What I remember is that every serious agency would not use such tool so that they can't be associated with plagiarism and duplicate content bans.

      Maybe it is just me but I cannot fathom the craziness, and hype of a first person output.

      What we get now with LLM models it not simply an output of link and description of lets say a question like What is an algorithm? We get an output that starts with "Let me explain" ... how is this learning and intelligence?

      We are just witnessing the next dot com boom, the industry as whole haven't seen such craziness despite all the efforts in the last 25 years. So I imagine that everyone wants to ride the wave to become the next PayPal mafia, tech moguls, philanthropist, inventors, billionaires...

      Chomsky summed it best.

      RIP Aaron

  • tempodox6 days ago
    [flagged]
    • error_logic6 days ago
      This despite the fact that what actually Made America Great was constructive, honest, healthy competition--not the insane destroy-the-competition monopolist's outlook which tries to destroy opportunity for others and thus competition itself.

      Horseshoe theory is real.

  • aaron6956 days ago
    [dead]
  • CaptainFever6 days ago
    This post pits the two people against each other. Am I the only one here who likes both Sam Altman and Aaron Swartz? They've both done great things to help remix culture.

    Sure, you could say that the law has come down differently on the two, but there are several differences: the timing (one was decades earlier), the nature of copying (one was direct, while the other was just to train and more novel), and the nature of doing (doing it individually vs as a corporation).

    But this doesn't have to reflect on them. You don't have to hate one and love the other... you can like both of them.