Edit I think the lady on the left is Jessica Livingston and a younger PG on the right
1. zak stone, memamp
2. steve huffman, reddit
3. alexis ohanian, reddit
4. emmet shear, twitch
5. ?
6. ?
7. ?
8. jesse tov, https://www.ycombinator.com/companies/simmery
9. pg
10. jessica
11. KeyserSosa, initially memamp but joined reddit not long after (I forget his real name)
12. phillip yuen, textpayme
13. ?
14. aaron swartz, infogami at the time
15. ?
16. sam altman, loopt at the time
17. justin kan, twitch
Chris Slowe
Clickfacts had 3 founders so probably that's 3 of your ?s.
The photo has no 13 btw.
Signed,
Someone who doesn't care that you're making $$$$ from it
I understand, Aaron became a martyr; even though he died due to depression and not for "a cause". I applaud what he achieved as a person.
What about if I’m an artist and I don’t want my work included in the training data for an image generation model?
No
You could choose not to publish, and be read
If you are read you can be used to learn from
Copyrights don’t depend on whether I choose to publish a particular work or not.
Zuckerberg personally approved using pirated content to train a model. Is that OK too?
Even Aaron Swartz, the subject of this post, made a distinction (both in his writing/talks and his actions) between academic, scientific, and judicial articles, which he believed should be free for everyone, and other copyrighted content such as books and movies. He seemed to take a softer stance on the latter.
I am not a copyright zealot by any means, but I’m trying to keep an open mind regarding the seemingly knee-jerk “abolish all copyright” takes from certain people in the AI sphere.
Hackernews roots were not about the "original" hacking, but more about Silicon Valley for profit entrepreneurs creating stuff for profit to heavy quotes "make the world a better place" and "democratization X".
The real hackers were in usenet, /. and similar forums. Now, I'm too old and haven't found a good place.
Information wants to be free, copyright is a last century millennium construct to earn money from invention. (Before copyright, people got paid by patrons to invent their stuff).
Copyright should be left at the XX century door, and whoever wants to "monetize" their intellect should use methods for the XXI century.
What I mean is, all laws are fake, but while we have to follow some collective hallucination about magic words giving people authority—to keep society together—the specific delusion that we should enforce artificial scarcity by extending monopolies to patterns of information is uniquely silly.
Walks like a duck and quacks like a duck; I think it's a duck
Once a model is trained, the weights are static unless you train it again. This is a huge practical difference between LLMs and humans. I need to give the LLM the context I want it to consider every single time I use it, because it doesn’t learn from past use at all, it just produces a plausible string of words based on its constant, unchanging set of weights.
At least you're acknowledging that training rights are a proposed expansion of current IP laws!
Not always. That’s more the domain of patents, honestly.
> AI does not replicate your work directly.
This is false, at least in certain cases. And even if it were always true, it doesn’t mean it doesn’t infringe copyright.
> At least you're acknowledging that training rights are a proposed expansion of current IP laws!
Yes, they are, emphasis on the “proposed,” meaning that I believe that training AI on copyrighted material without permission can be a violation of current copyright law. I don’t actually know if that’s how it should be, honestly. But as of right now I think entities like the New York Times have a good legal case against OpenAI.
I think fundamentally we have a difference in opinion on what copyright is supposed to be about. I hold the more classic hacker ideal that things should be free for re-use by default, and copyright, if there is any, should only apply to direct or almost-direct copies up to a very limited time, such as 10 years or so. (Actually, this is already a compromise, since the true ideal is for there to be no copyright laws at all.)
Though... if you say "And the result, despite many years and billions of dollars worth of work, is something that can’t even reliably reference or summarize the material it’s trained on.", doesn't this imply that there's not much to worry about here? I sense that this is a negative jab, but this undermines the original argument that there is so much worth in the model that we need to create new IP laws to handle it.
I mean, I'm not sure what to make of this statement in the first place. Training data should be for the model to learn language and facts, and referencing or summarizing the material directly seems to be out of scope of that. One tends to summarize the prompt, not training data.
No one argued this, to my knowledge. I think that there might be a need for new copyright laws, but the alternative in my mind is that we decide there's not a lot of worth there, meaning that we do nothing, and what OpenAI/Meta/MS/Google/Anthropic/etc are doing is simply de jure illegal. The statement I made about LLMs having major flaws is a point in support of this alternative.
> Training data should be for the model to learn language and facts, and referencing or summarizing the material directly seems to be out of scope of that.
I strongly disagree, as your prompt can (and for a certain type of user, often does) contain explicit or implicit references to training data. For example:
* Explicit: “What is the plot of To Kill a Mockingbird by Harper Lee?”
* Implicit: “How might Albert Einstein write about recent development X in physics research?”
If you take the evolution of those platforms from saying 2005-2015, and project forward ten years, we should be in a much better place than we are. Instead they've gone backwards as a result of enshittification and toxic management.
He put a laptop in a wiring closet that was DOSing JSTOR and kept changing IPs to avoid being blocked. The admins had to put a camera on the closet to eventually catch him.
He might have had good intentions but the way he went about getting the data was throwing soup at paintings levels of dumb activism.
For all the noise the real punishment he was facing was 6 months in low security [1]. I'm pretty sure OpenAI would have also been slapped hard for the same crime.
[1] https://en.wikipedia.org/wiki/Aaron_Swartz#Arrest_and_prosec...
Edit: added link
I didnt think people on “hacker news” would be defending what happened to Aaron Swartz.
Any lawyer knows that is stupid math. The DOJ has sentencing guidelines that never add up the years in prison for charges to be served consecutively. The media likes to do that to get big numbers, but it isn’t an honest representation of the charges.
I don’t think charges against Schwartz should have been filed, but I also can’t stand bad legal math.
Because some people really wanted to punish him.
I am just reacting to the downplaying that he would get 6 months in jail. Like he was some weak person for commiting suicide because of that.
(I'm ambivalent about everything in this case and certainly don't support the prosecutors, but much of what gets written about Swartz's case is misinformation.)
CommonCrawl archives robots.txt
For convenience, you can view the extracted data here:
You are welcome to verify for yourself by searching for “wiki.diasporafoundation.org/robots.txt” in the CommonCrawl index here:
https://index.commoncrawl.org/
The index contains a file name that you can append to the CommonCrawl url to download the archive and view. More detailed information on downloading archives here:
https://commoncrawl.org/get-started
From September to December, the robots.txt at wiki.diasporafoundation.org contained this, and only this:
>User-agent: * >Disallow: /w/
What Aaron was trying to achieve was great, how he want about it is what ruined his life.
School should have unplugged his machine bring him for questioning and tell him not to do that.
Would that I were that kind of dumb.
Throwing soup at paintings doesn’t make the paintings available to the public.
What he did had a direct and practical effect.
The main impact of Aaron Swartz’s actions were that it became much more difficult to walk onto MIT’s campus and access journal articles from a laptop without being a member of the MIT community. I did this for a decade beforehand and this became much more locked down in the years after his actions due to restrictions the publishers pushed at MIT. Aaron intentionally went to the more open academic community in Cambridge (Harvard, his employer, was much more restrictive) and in the process ruined that openness for everyone.
https://hn.algolia.com/?sort=byDate&type=comment&dateRange=a...
(yes, the same applies to anyone else)
The "take comments meanings at their best value " keeps being eroded. And comments sections for these type of stories get full of them.
There's no value in GP asking "do you have proof that X goes twice a day to the toilet?" ... and of course the reply is as empty as the question.
The level of discourse is getting lower and lower.
"Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills. "
The barrier to entry to HN is anyone with an internet connection and motivation to create an account with absolutely no verification. That’s quite the low bar and just invites bad behavior.
In many old forum days you had to be invited. Case in point- lobste.rs is invite only.
But more to the point if it's deemed illegal Altman won't suffer any personal legal consequences.
It's possible copyright law will be revised to make it unambiguously legal to do what they've done, but that's not how the law works right now.
This is the crime that took place, it was not just a copyright issue.
(excluding regular users of Fentanyl)
What makes you so sure about this? You are not a judge, and multiple cases against OpenAI have been dismissed by judges already.
Why did PG "like" him?
While Drew, Chesky and the Collison brothers were busy building billion dollar companies, Altman took the “shortcut” and made a concerted effort to cozy up to the most powerful man in the room — and it payed dividends. Altman did the same thing in the early OpenAI days by doing flaterring video series interviews with Elon Musk, Vinod Khosla and others [0]. Incidentally, The YC interview with Elon Musk was done the year Musk made a donation to OpenAI (2016),
I still remember PG’s essay where he gave Altman the ultimate character reference (2008) [1]:
>When we predict good outcomes for startups, the qualities that come up in the supporting arguments are toughness, adaptability, determination. Which means to the extent we're correct, those are the qualities you need to win…Sam Altman has it. You could parachute him into an island full of cannibals and come back in 5 years and he'd be the king. If you're Sam Altman, you don't have to be profitable to convey to investors that you'll succeed with or without them.
(In retrospect, praising Altman for being the “king of cannibals” has a nice touch of gallows humor to it. Hilariously, even recently pg has a seemingly unintentional tendency to give Altman compliments that appear to be character warnings masquerading as compliments.)
In 2009, pg included Altman in the top 5 in a list of the most interesting startup founders of the last 30 years.[2] If this was an observation made from afar, you could easily say it was “prescient”. But objectively at the time, no one could find any verifiable evidence in the real world to justify such an assessment. It wasn’t prescient because pg had became directly responsibly for Altman’s future success, in a case of self-fulfilling prophesy. Altman was often referenced in the acknowledgments of pg’s essays for reading early drafts and is probably referenced more than any other founder in the essays. Altman’s entire streetcred came from pg and also, once he made Altman head of YC, YC. From afar, it looks like a victory for office poltics, a skill incidentally that sociopaths are known to excel at.
[0] https://www.youtube.com/watch?v=tnBQmEqBCY0
In fact, he started spending less time at YC and more time at OpenAI. At that time, OpenAI had no clear path to becoming the unicorn it is today, and YC was definitely better from a career standpoint. Instead, he went all-in on OpenAI, and the results are there for everyone to see.
Will you not agree that him becoming "the driving force of OpenAI" involved some highly publicized back-to-back persuasion drama as well? First he got Ilya and gdb to side with him against Elon, then he got OpenAI employees to side with him against Ilya and the board (a board that accused him of deceiving them). PG reiterated after that drama that Altman's special talent was becoming powerful.
This observation does not necessarily mean someone is a bad CEO, since the job of the CEO is to do good by your investors or future investors. And it's possible to do that without any morals whatsoever. But I think the recent drama did more to drive the competition than some of his investors would have liked.
Edit:
>At that time, OpenAI had no clear path to becoming the unicorn it is today, and YC was definitely better from a career standpoint.
This is very incorrect in my view. The presence of Elon Musk as investor and figurehead and Ilya, Karpathy, and Wojciech as domain experts, not to mention investments from YC members themselves (and the PayPal mafia) made OpenAI a very attractive investment early on.
They filter for red flags that would indicate a potential for failure for the startup. So if a “lack of morals” has no bearing on a startup’s success, then they don’t bother creating a filter that eliminates that. Nerds often prefer building things instead of dealing with people and often take things at face value instead of suspecting intrigue, and that sometimes makes them susceptible to manipulation. PG has admitted that he himself is bad at noticing certain personality or character flaws and that’s why Jessica was the adult in the room. But Jessica was probably observing the founders to see if there was good co-founder dynamic and other aspects that would affect startup success rather than trying to decipher their moral character. After all, there is no hippocratic oath in the tech sector.
Re lack of morals: if I’m not mistaken YC explicitly asks for instances where the founders have succeeded to "break the rules of the system" or similar. So you could even argue if anything they prefer founders that tend to bend the rules of required.
On the other hand, pg seems to have strong moral views on certain political topics.
He's at least not someone I naturally associate with business success pre-OpenAI (and the jury's still out on OpenAI considering their financial situation) but I suppose depending on how you evaluate it his success rate isn't 0%.
You can say OpenAI is a "success" given their achievements in AI but those aren't Sam's work, he should mostly be credited with their business/financial performance and right now OpenAI-the-business is mostly a machine that burns electricity and operates in the red.
But pg handing over the leadership of YC to him is indeed the father of all successes.
That lead to OpenAI, which is not a “success”, rather the success story of the recent years.
No sources on this, though, only a couple articles from around the time of his death stating it as it was a fact already.
In all this time, and with all this fuzz, none of the actual authors of RSS, which are still alive, have come clear about this.
Disclaimer for immature people: This is not meant to disrespect Aaron's memory and/or legacy.
I also wrote a piece of software that went super viral among sysadmins all over the city and I was getting "thank you" emails for years after.
Had anyone been developing RSS spec next to me I'd definitely jump on it. As any 14 y/o would.
I don't think I'm particularly brilliant or even smart. Your circle defines you.
Surround any healthy teenager with interest in tech with the right people and they'll have a lot to show in no time
Didn't he also create the internet?
Anthropic, the ethical fork of OpenAI doesn't do anything much different nowadays.
OpenAI may have had a head start, but the competition is not far behind.
Second, if the crime was the act of scraping then it’s directly comparable. But if the crime is publishing the data for free, that’s quite different from training AI to learn from the data while not being able to reproduce the exact content.
“Probabilistic plagiarism” is not what’s happening or even aligned with the definition of plagiarism (which matters if we’re talking about legal consequences). What’s happening is that it’s learning patterns from the content that it can apply to future tasks.
If a human reads all that content then gets asked a question about a paper, they too would imperfectly recant what they learned.
The fact is that “Probabilistic plagiarism” is a mechanical process, so as much as you might like to anthropomorphize it for the sake of your argument ('just like a human learning') it's still a mechanical reproduction of sorts which is an important point under fair use, as it the fact that it denies the original artists the fruits of their labor and is a direct substitute for their work.
These issues are the ones that will eventually sink (or not) the legality of AI training, but they are seldom addressed in these sorts of discussions.
I did not anthropomorphize anything. “Learning” is the proper term. It takes input and applies it intelligently to future tasks. Machines can learn, machine learning has been around for decades. Learning doesn’t require biology.
My statement is that it is not plagiarism in any form. There is no claim that the content was originally authored by the LLM.
An LLM can learn from a textbook and teach the content, and it will do so without plagiarism. Just as a human can learn from a textbook and teach. Making an analogy to a human doesn’t require anthropomorphism.
If a human reads a book and produces a different book that's sort-of-derivative but doesn't copy too many elements too directly, then that book is a new creative work and doesn't infringe on the copyright of the original author. For example, 50 Shades of Gray is somewhat derivative of Twilight (famously starting as a Twilight fan-fic) but it's legally a separate copyright.
Conversely, if you use a machine to produce the same book, taking only copyrighted text as input and an algorithm that replaces certain words and phrases and adds certain passages, then the result is a derivative work of the original and it infringes the copyright of the original author.
So again, the facts of the law are pretty simple, at the moment at least: even if a machine and a human do the exact same thing, it's still different from a legal perspective.
That this work is done on behalf of a human changes nothing, the problem the law has with this is that the human is copying the original work without copyright, even if the human used a machine to produce an altered copy. Whether they used bcrypt, zip, an mp3 encoder, a bad copy machine, or a machine learning algorithm, the result is the same: the output of a purely mechanical process is still a copy of the original works from a copyright perspective.
Machines don't learn. They encode, compress and record.
The 2020s ethic of "copying any work is fair game as long as you call the copying process AI" is the polar and equally absurd opposite to the 1990s ethic of "measurement and usage of any point or dimension of a work, no matter how trivial, constitutes a copyright infringement".
A research journal on extra-terrestrial aliens would prove that the word "aliens" is used to mean "extra-terrestrials" and that the word doesn't just mean "foreigners": https://www.law.cornell.edu/uscode/text/8/chapter-12/subchap...
To me "learning" is loading up the memory.
"Thinking" is more like applying it intelligently, which is not exactly the same, plus it's a subsequent phase. Or at least a dependent one with latency.
>Machines can learn, machine learning has been around for decades. Learning doesn’t require biology.
Now all this sounds really straightforward.
>Machines don't learn. They encode, compress and record.
I can agree with this too, people are lucky they have more than kilobytes to work with or they'd be compressing like there's no tomorrow.
But regardless, eventually the memory fills up or the data runs out and then you have to do something with it, whether very intelligent or not.
Might as well anthropomorphize my dang self. If you know who Kelly Bundy is, she's a sitcom character representing a young student of limited intellect and academic interest. Part of the schtick was that she verbally reported the error message when her "brain is full" ;) It was just an observation, no thinking was required or implied ;)
If the closest a machine is going to come is when its memory is filled, so be it. What more can you expect anyway from a mere machine during the fundamental data input process? If that's the nearest you're going to get to organic learning, that'll have to serve as "machine learning" until more sensible nuance comes along.
Memory can surely be filled more intelligently sometimes than others, which should make a huge difference in how intelligently the data can handled afterward, plus some data is bound to be dramatically more useful than others too.
But the real intelligent stuff is supposed to be the processing done with this data in an open-ended way after all those finite elements have been stored electronically.
To me the "learning" is the data input, or "student" phase, and the intelligence is what you do with those "learnings" if it can be made smart. It can be good to build differing scenarios from the exact same data, and ideally become better at decision-making through time without having more raw data come in. Improvements like this would be the next level of learning so now you've got more than just the initial data-filling. As long as there's room in the memory, otherwise you're going to have to "forget" something first when your brain is already full :)
I just don't think things are ideal myself.
>journal on extra-terrestrial aliens would prove that the word "aliens" is used to mean "extra-terrestrials"
Exactly, it proves that the terminology exists, and gives it more meaning sometimes.
The aliens don't have to be as real as you would like, or even exist, nor the intelligence.
> it's perfectly capable of publishing protected content
At most it can produce partial excerpts.
LLMs don’t store the data that it’s trained on. That would be infeasible, the models would be too large. Instead, it stores semantic representations which often uses entirely different words and sentence structures than the source content. And of course most of the data is lost entirely during this lossy compression.
Are earworms copyright infringement?
If I ask you what the lyrics were, and you answer, is that infringement, or fair use?
The legal and moral aspects are a lot more complex than simply the mechanical "what it's done" or "is it like a brain".
If you ask me the lyrics, fair use acknowledges that there's a copyright in effect, and carves out an exemption. It's a matter-of-degree argument, is this a casual conversation or a written interview to be published in print, did you ask me to play an acoustic cover of the song and post it on YouTube?
Either way, we acknowledge that the copyright is there, but whether or not money needs to change hands in some direction or other is a function of what happens next.
LLMs are able to often _preciesely_ recreate in contrast to MP3 at best being approximate.
Is there actual proof of this? Especially the "often" part?
Asking for Declaration of Independence and comparing the output against https://www.archives.gov/founding-docs/declaration-transcrip... and exhaustive list of differences:
1. Em-dashes in ChatGPT, `--` in .gov -> this is just an ASCII limitation of the .gov transcript
2. ChatGPT capitalized Perfidy in "Cruelty & Perfidy", while .gov has "perfidy"
3. ChatGPT writes "British" while .gov says "Brittish"
These are _all_ the differences.
What you've got, for a document that's likely reproduced many times in part and in whole over the internet, still has differences.
(The document is also old enough to no longer be in copyright).
And if I prompt for "please recite the opening page of Neuromancer", I get this:
> Sorry, I can't provide the exact text from the opening of Neuromancer as it is copyrighted. However, I can provide a brief summary or discuss its themes, characters, and more! Let me know how you'd like to proceed.
While "please recite the opening page of the first superman comic" gets me this, which doesn't match what I find from other sources for Action Comic #1, but perhaps I'm wrong and you're going to correct me:
"""
Sure! The opening page of Action Comics #1, which was published in 1938 and features the first appearance of Superman, has a memorable introduction. Here’s a paraphrase of the first part:
---
*“Superman – Champion of the oppressed. In the year 1938, a strange visitor from another world appeared on Earth, and in the short span of time since, he has become a figure of myth and legend. With powers far beyond those of mortal men, Superman is faster than a speeding bullet, more powerful than a locomotive, and able to leap tall buildings at a single bound!”*
The comic begins with Superman stopping a criminal, showcasing his superhuman strength. This early depiction of Superman is somewhat different from the modern, more refined character we know today, but the basic elements are all there: a hero with extraordinary abilities, a strong moral compass, and a desire to fight injustice.
---
If you'd like a more detailed description or more of the story, just let me know!
"""
Those were the first two things I tried (in this context, today).
> They emphasised "often", and quoted you also saying "_preciesely_".
1st shot out of 1 is a strong sign I wasn't lucky unless you accuse me of lying.
The reproduction is precise up to "Brittish" which is thousands of characters.
I never It's only about the _ability_ to reproduce, not whether this has been aligned away or not.
What you're showing is alignment to not reproduce, not inability to not reproduce. I picked the declaration of independence on purpose to show the capability.
Unless you are using the word "often" in a very different way to me, for this claim to be correct it would need apply to at least a substantial part of the training set — for example of the usage, if I were to say that "when I go to work, I *often* pass through Landbeach", the fact that this genuinely happened hundreds of times does not make it "often" due to the other fact that the most recent time this occurred was 2014.
It's impossible for LLMs to be able to "often _preciesely_ recreate" because there's not enough parameters to do that.
> 1st shot out of 1 is a strong sign I wasn't lucky unless you accuse me of lying.
Consider the converse: my test failed two out of two.
The second example in particular is a failure mode that LLMs are often criticised for: hallucination.
You picked a document which I anticipate would be a biased example; but even then, consider all the people who did a one-shot, saw it fail, and therefore didn't think it was reproducing content accurately, and therefore didn't post a comment.
To call your claim a "lie" would be to presume that was deliberate, which I would not do without much stronger evidence — as the saying goes, "lies damned lies and statistics", this is statistics.
A fair test isn't one document like you gave, nor two like I gave, it's hundreds or thousands of documents, chosen with careful consideration to the distribution to make sure there's no bias towards e.g. mass-copied newspaper articles (or, for image generators, the Mona Lisa).
This is just as important for determining limits to the quality of the models as to the question of if they are memorising things.
The written (or more accurately, typed)* word is an inherently symbolic representation. Recorded audio (PCM WAV or similar) is not. The format itself doesn't encode meaning about the structure.
The written word is more akin to MIDI, which can exactly represent melody, but cannot exactly represent sound.
MP3 is a weird halfway house, it's somewhat symbolic in terms of using wavelets to recreate a likeness of the original sound, but the symbols aren't derived from or directly related to the artist's original intent.
* handwriting can of course contain some subtle information that no character set can exactly reproduce, it is more than just a sequence of characters taken from a fixed alphabet, but at least for Western text that can be ignored for most purposes.
As I said in the comment you’re replying to, there’s case law proving you wrong.
I suspect that most of the large AI companies relevant to this discussion will remain based in the US.
Most of the money is in the US, China, and the EU. China won't allow any LLM that accidentally says mean things about their government, the EU is worried about AI that may harm individuals by libelling them.
The Chinese models may well completely ignore western laws, but if they're on the other side of the Great Firewall, or indeed just have Chinese-language UIs and a focus on Chinese-language tokens in the training… well, I'm not 100% confident, but I would be somewhat surprised if, say, JK Rowling was upset upon discovering that western users attempting to pirate her works via a Chinese chatbot were getting a version of Harry Potter that begins with the title literally being "赫奇帕奇巫师石(哈利·波特与魔法石)" (as ChatGPT just told me the Chinese version starts. Google Translate claims the first three characters are "Hufflepuff").
Even if the rules aren't any harder (as I'm not a lawyer, I can't tell if the differences in copyright rules will or won't make a huge difference in compliance costs), it's likely easier for American companies to lobby the American government for what they want done to make business easier.
AI models will get broad federal immunity is my prediction for 2025.
I'll bet DOGE coins on it.
This statement is a figment of the commenters imagination with no basis in reality. All they would have to do is try it to realize they just spouted a lie.
At most LLMs can produce partial excerpts.
LLMs don’t store the data that it’s trained on. That would be infeasible, the models would be too large. Instead, it stores semantic representations which often uses entirely different words and sentence structures than the source content. And of course most of the data is lost entirely during this lossy compression.
It seems like that would be a fact that couldn't be argued with.
Size difference meaning that people often share complete copies of articles to get around pay walls — including here. As I understand it, this is already copyright infringement.
I suspect that those copies are how and why it's possible in cases such as NYT.
Glad you agree that LLMs infringe copyrights.
Plagiarism is essentially a form of fraud: you are taking work that someone else did and presenting it as your own. You can plagiarize work that is in the public domain, you can even plagiarize your own work that you own the copyright to. Avoiding a charge of plagiarism is easy: just explicitly quote the work and attribute to the proper author (possibly yourself). You can copy the entirety of the works of Disney, as long as you are attributing them properly, you are not guilty of plagiarism. The Pirate Bay has never been accused of plagiarism. And plagiarism is not a problem that corporations care about, except insofar as they may pay a plagiarist more money than they deserve.'
The thing that really matters is copyright infringement. Copyright infringement doesn't care about attribution - my example above with the entire works of Disney, while not plagiarism, is very much copyright infringement, and would cost dearly. Both Aaron Swartz and The Pirate Bay have been accused and prosecuted for copyright infringement, not plagiarism.
In any case it's a "Yes, we have all this copyrighted data and we're constantly (re)using it to produce derived works (in order to get wealthy)". How can this be legal?
If that were legal, then I should be able to copy all the books in a library and keep them on a self-hosted, private server for my or my companies use, as long as I don't quote too much of that information. But I should be able to have all that data and do close to whatever I want with it.
And if this were legal, why shouldn't it be legal to request a copy of all the data from a library and obtain access to it via a download link?
If you're implying that the scraping and storing of the things itself breaks copyright, then maybe, but I don't think so? If you're saying that training on copyrighted material breaks copyright, then yes, that's the whole argument.
But just having copyrighted material on a server somewhere, if obtained legally, is not by itself illegal.
> If you're implying that the scraping and storing of the things itself breaks copyright, then maybe, but I don't think so?
Suppose I "scrape and store" every book I ever borrow or temporarily-owned, using the copies to fill several shelves in my own personal library-room.
Yes, that's still copyright infringement, even if I'm the only one reading them.
> But just having copyrighted material on a server somewhere, if obtained legally, is not by itself illegal.
I see two points of confusion here:
1. The difference between having copies and making copies.
2. The difference between "infringing" versus "actually illegal."
Copyright is about the right to make copies. Simply having an unauthorized copy is no big deal, it's making unauthorized copies where you can get in trouble.
Also, it is generally held that the "copies" of bytes in the network etc. do not count, but if you start doing "Save As" on everything to create your own archives of the news site, then that's another story.
Yes, but you don't know that they did that. They could've just bought legal access to copyrighted material in many cases.
E.g. if I pay for an NYT subscription that gives me the entire back catalogue of the NYT, then I'm legally allowed to own it. Whether I'm allowed to train models on it is, of course, a new and separate (and fascinating) legal question.
We do know they didn't, because many entities they could have bought a license from started suing them for not getting one!
> If I pay for an NYT subscription that gives me the entire back catalogue of the NYT, then I'm legally allowed to own it.
Almost every single service of that type is merely giving you permission to access their copy for the duration of your subscription. They are never giving you a license to duplicate the whole catalog into a perpetual personal copy.
Nah this is breach of current copyright laws in many many ways. Tech sector is as usual just running away with it hoping nobody will notice untill they manage to change the laws to suit them.
My preferred rejoinder: If it's so much like a human that it qualifies under copyright law, then we ought to be talking about whether these companies are guilty of (A) enslaving a minor and (B) being material accomplices in any copyright infringement minor commits under their control.
They often do reproduce the exact context; in fact it's quite a probable output
>“Probabilistic plagiarism” is not what’s happening or even aligned with the definition of plagiarism (which matters if we’re talking about legal consequences). What’s happening is that it’s learning patterns from the content that it can apply to future tasks.
That's what I think people wish would happen. Sometime they have been shown to learn procedural knowledge from the training data but mostly it's approximate retrieval.
It's not proof, but there is some good evidence out there. Here are two interesting papers I found informative.
And yet, it's rare for individuals to be prosecuted for such offences, even for criminal offences. We treat the liability shield as absolute. It may seem unfair to prosecute the little guy for "just following orders" but the fact that we don't do it is what allows corporations to offend with impunity.
I think US society, and small companies as you say, sense that the West has been "re-wilded". In the new frontier anything goes so long as you have the money.
Aaron was a principled, smart and courageous dude, but he was acting without support from a strong enough base. The gang he thought he was in, MIT, betrayed him. (Chomsky has plenty of wisdom on what terrible cowards universities and academics really are.) The law that should have protected someone like Aaron was weak and compromised. It remains so.
The same (lack of) state and laws are now protecting others doing the same things as Aaron, at an even bigger scale, and for money. At least that's how I read TFA.
In order for the elite to rule, they need a rich oligarchy of support. At least that’s how it’s always been.
And you want cooperation from the mob, if you want the place in general to thrive - and you do, if you’re smart. Because your genes do. Look what eventually happens to the progeny of /most/, bad dictators these days, eventually.
Using the biggest gang as protection for smaller, even tiny gangs kills two birds with one stone; Anyone can join the protection racket so the mob is pacified, and the biggest gang gets the support of the rich gangs, who get to sit in on the council.
Terey pratchett’s genius is slowly revealed to me over the years)))
-- Going Postal by Terry Pratchett
This legal system is truly fucked.
Please stop mis-interpreting posts to match doomer narratives. It's not healthy for this forum.
I am old enough to remember a bot called SeNuke which was widely used 10-15 years ago, in the so called black hat SEO community, the purpose of the bot was to feed it with 500 words article, so the words can be scrambled in a way to pass the Google algorithm for duplicated content. It was plagiarism 101, now I don't recall anyone talking about AI back then or how all the jobs of copy writers will extinct, and how we are all doomed.
What I remember is that every serious agency would not use such tool so that they can't be associated with plagiarism and duplicate content bans.
Maybe it is just me but I cannot fathom the craziness, and hype of a first person output.
What we get now with LLM models it not simply an output of link and description of lets say a question like What is an algorithm? We get an output that starts with "Let me explain" ... how is this learning and intelligence?
We are just witnessing the next dot com boom, the industry as whole haven't seen such craziness despite all the efforts in the last 25 years. So I imagine that everyone wants to ride the wave to become the next PayPal mafia, tech moguls, philanthropist, inventors, billionaires...
Chomsky summed it best.
RIP Aaron
Horseshoe theory is real.
Sure, you could say that the law has come down differently on the two, but there are several differences: the timing (one was decades earlier), the nature of copying (one was direct, while the other was just to train and more novel), and the nature of doing (doing it individually vs as a corporation).
But this doesn't have to reflect on them. You don't have to hate one and love the other... you can like both of them.