The Open Source AI Definition RC1 Is Available for Comments(opensource.org)

47 pointsby foxbeea year ago11 comments

swyxa year ago
D.O.A without adoption from the major model labs (including the "opener" ones like AI2 and lets say Together/Eleuther). i dont like the open source old guard feeling like they have any say in defining things when they dont have skin in the game. (and yes, this is coming from a fan of their current work defending the "open source" term in traditional dev tools). a good way to ensure decline to irrelevance is to do a lot of busywork without ensuring a credible quorum of the major players at the table.
please dont let me discourage tho, i think this could be important work but if and only if this gets endorsement from >1 large model lab producing any interesting work
- sigh_againa year ago
  > they have any say in defining things when they dont have skin in the game.
  Then, maybe don't go around stealing and bastardizing the "open source" concept when absolutely none of the serious AI research is open source or reproductible. Just because you read a fancy word online once and think you can use it doesn't mean you're right.
  - appendix-rocka year ago
    How does this constitute a meaningful reply to OP? What’s your point? That you’re angry? That doesn’t rebut anything they’ve said.
- jszymborskia year ago
  > D.O.A without adoption from the major model labs
  I definitely disagree. Adoption of open licenses has historically been "bottom-up", starting with academia and hobbyists and then eventually used by big names. I have zero idea why that can't be the case here.
  I know I'll be releasing my models under an open license once finalized.
  - Incipienta year ago
    Where hobbyists and small players could release code as easily as anyone big...i don't believe that's the case with AI, especially llms. Is it not only the large companies that are able to release meaningful content?
    jszymborskia year ago
    Important work still gets done on smaller models using consumer GPUs. I've trained protein LLMs for my PhD on as little as a single RTX 3090. This is even more so the case with Computer Vision.
- blackeyeblitzara year ago
  Why should the “old guard” not have to have the say when they came up with the idea of open source? It is misleading to adopt terminology with well known definitions and abuse it. People like Meta are free to use some other terminology that isn’t “open source” to describe their models, which I cannot reproduce because they’ve release nothing except weights and inference code.
  - appendix-rocka year ago
    Because language is rarely if ever prescriptive. It evolves organically and without much rhyme or reason beyond “because that’s how things went”. The fact that you think that software neckbeards, err, greybeards, are somehow exempted from that is a hilarious example of ‘tech exceptionalism’ brought to its natural conclusion.
blackeyeblitzara year ago
A reinforcement of definitions is needed. Open weights is NOT open source. But there are people like Meta that are rampantly open washing their work. The point of open source is that you can recreate the product yourself, for example by compiling the source code. Clearly the equivalent for an LLM is being able to retrain the model to produce the weights. Yes I realize this is impractical without access to the hardware, but the transparency is still important, so we know how these models are designed, and how they may be influencing us through biases/censorship.
The only actually open source model I am aware of is AI2’s OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...), which includes training data, training code, evaluation code, fine tuning code, etc.
The license also matters. A burdened license that restricts what you can do with the software is not really open source.
I do have concerns about where OSI is going with all this. For example, why are they now saying that reproducibility is not a part of the definition? These two paragraphs below contradict each other - what does it mean to be able to “meaningfully fork” something and be able to make it more useful if you don’t have the ingredients to reproduce it in the first place?
> The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.
> Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone.
- MichaelNolana year ago
  > what does it mean to be able to “meaningfully fork” something and be able to make it more useful if you don’t have the ingredients to reproduce it in the first place?
  I could be misunderstanding them, but my takeaway is that exact bit for bit reproducibility is not required. Most software, including open source, is not bit for bit reproducible. Exact reproducibility is a fairly new concept. Even with all the training data, and all the code, you are unlikely to get the exact same model as before.
  Though if that is what they mean, then they should be more explicit about it.
- dchichkova year ago
  I agree, Open Weights are Open "Binary", not Open Source.
  It's like taking an executable (.so module, firmware blob) and releasing it under permissive license, so anyone could disassemble, modify and hack it. And then disclosing what programming languages were used and pointing at a few libraries. And then saying that no, actual source code is not going to be released.
wmfa year ago
Various organizations are willing to release open weights but not open source weights according to this definition, so this is going to be a no-op. Open source already existed before the OSI codified it, but now they're trying to will open source AI into existence against tons of incentives not to.
pabs3a year ago
This doesn't look like a proper open source AI definition to me, I prefer what the Debian folks came up with.
https://salsa.debian.org/deeplearning-team/ml-policy
a year ago
undefined
datascientista year ago
also see https://gradientflow.com/open-source-principles-in-foundatio...
godelskia year ago
I don't think this makes sense nor is consistent with itself, let alone its other definition[0]
```
  > The aim of Open Source is not and has never been to enable reproducible software.
  ...
  > Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone. 
  ...
  > Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias.
```
For these things, it does mean what most people are asking for: training details.
So far companies are just releasing checkpoints and architecture. It is better than nothing and this is a great step (especially with how entrenched businesses are[1]). But if we really want to do things like fixing security issues or remove bias, you have to be able to understand the data that it was originally trained on AND the training procedures. Both of these introduce certain biases (via statistical definition, which is more general). These issues can't all be solved by tuning and the ability to tune is significantly influenced by these decisions.
The reason we care about reproducible builds is because it matters to things like security, where we know what we're looking at is the same thing that's in the actual program. It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source. Trust matters, but the saying is "trust but verify". Sure, you can also fix vulns and bugs in closed source software, hell, you can even edit or build on top of it. But we don't call these things open source (or source available) for a reason.
If we're going to be consistent in our definitions, we need to understand what these things are at at least a minimal level of abstraction. And frankly, as a ML researcher, I just don't see it.
That said, I'm generally fine with "source available" and like most people use it synonymous with "open source". But if you're going to go around telling everyone they're wrong about the OSS definition, at least be consistent and stick to your values.
[0] https://opensource.org/osd
[1] Businesses who's entire model depends on OSS (by OS's definition) and freely available research
- ensignavengera year ago
  "Reproducible build" is a term used to refer to getting an exact binary match out of a build. This is outside the scope of the OSD. I am not certain, but it sounds like this is what they are talking about here. Just because you run the build yourself doesn't mean you will get an exact match of what the original producer built. Something as simple as a random number generator or using a timestamp in the build will result in a mismatch.
  - godelskia year ago
    > "Reproducible build" is a term used to refer to getting an exact binary match out of a build.
    I'm not sure what makes you think I failed to understand this. Allow me to quote myself
    >> It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source.
    But also, my entire point is not really about the reproducible build aspect. It is that if we're going to draw an analogy then the training and data IS the source. At worst, we'd say it is the build instructions.
    But maybe I don't understand Open Source. Is it still Open Source if I provide the source code, an Apache License, but the code is in my own custom language (for fun, let's say it reads like brainfuck) and I have no released the compiler? Maybe some people would call this Open Source, but I imagine it would ruffle a lot of feathers. Is there really a meaningful difference between that an a binary? If it does fit "the letter of the law" it most certainly does not fit "spirit of the law". It is the spirit of the law that matters, because it is the whole fucking point.
tananaeva year ago
The definition is good because currently many call their open model weights as open "source". But I suspect most companies will still call their models open source even when they're not.
sigh_againa year ago
[flagged]
glkanba year ago
Ok, decent first steps. Now approve a BSD license with an additional clause that prohibits use for "AI" training.
Just like a free grazing field would allow living animals, but not a combine harvester. The old rules of "for any purpose" no longer apply.
exaca year ago
> The aim of Open Source is not and has never been to enable reproducible software.
Okay, well just because you have the domain name "opensource.org" doesn't give you the ability to speak for the community, and the community's understanding of the term.
opensource.org is irrelevant.
- FrustratedMonkya year ago
  I agree.
  "never been to enable reproducible software"
  I'd say, sure "Never" is a big word.
  Having open code that everyone can read and run, was partly to allow for reproducibility. In the closed world, how is anybody reproducing anything, being open does enable that.
  - saurika year ago
    The article seems to cover this nuance in the next paragraphs?
- saurika year ago
  I mean, I've never understood "open source" to require reproducibility? That concept barely even existed as a thing people strove for until 15 years ago, a lot of software still only barely supports such, and there are tons of tradeoffs that come with it (as you effectively then also inherit your entire toolchain as vendor maintained, and a lot of projects end up making that result in awkward binaries, as almost no one reproduces entirely from a small bit of bootstrapped lisp).