please dont let me discourage tho, i think this could be important work but if and only if this gets endorsement from >1 large model lab producing any interesting work
Then, maybe don't go around stealing and bastardizing the "open source" concept when absolutely none of the serious AI research is open source or reproductible. Just because you read a fancy word online once and think you can use it doesn't mean you're right.
I definitely disagree. Adoption of open licenses has historically been "bottom-up", starting with academia and hobbyists and then eventually used by big names. I have zero idea why that can't be the case here.
I know I'll be releasing my models under an open license once finalized.
The only actually open source model I am aware of is AI2’s OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...), which includes training data, training code, evaluation code, fine tuning code, etc.
The license also matters. A burdened license that restricts what you can do with the software is not really open source.
I do have concerns about where OSI is going with all this. For example, why are they now saying that reproducibility is not a part of the definition? These two paragraphs below contradict each other - what does it mean to be able to “meaningfully fork” something and be able to make it more useful if you don’t have the ingredients to reproduce it in the first place?
> The aim of Open Source is not and has never been to enable reproducible software. The same is true for Open Source AI: reproducibility of AI science is not the objective. Open Source’s role is merely not to be an impediment to reproducibility. In other words, one can always add more requirements on top of Open Source, just like the Reproducible Builds effort does.
> Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone.
I could be misunderstanding them, but my takeaway is that exact bit for bit reproducibility is not required. Most software, including open source, is not bit for bit reproducible. Exact reproducibility is a fairly new concept. Even with all the training data, and all the code, you are unlikely to get the exact same model as before.
Though if that is what they mean, then they should be more explicit about it.
It's like taking an executable (.so module, firmware blob) and releasing it under permissive license, so anyone could disassemble, modify and hack it. And then disclosing what programming languages were used and pointing at a few libraries. And then saying that no, actual source code is not going to be released.
> The aim of Open Source is not and has never been to enable reproducible software.
...
> Open Source means giving anyone the ability to meaningfully “fork” (study and modify) a system, without requiring additional permissions, to make it more useful for themselves and also for everyone.
...
> Forking in the machine learning context has the same meaning as with software: having the ability and the rights to build a system that behaves differently than its original status. Things that a fork may achieve are: fixing security issues, improving behavior, removing bias.
For these things, it does mean what most people are asking for: training details.So far companies are just releasing checkpoints and architecture. It is better than nothing and this is a great step (especially with how entrenched businesses are[1]). But if we really want to do things like fixing security issues or remove bias, you have to be able to understand the data that it was originally trained on AND the training procedures. Both of these introduce certain biases (via statistical definition, which is more general). These issues can't all be solved by tuning and the ability to tune is significantly influenced by these decisions.
The reason we care about reproducible builds is because it matters to things like security, where we know what we're looking at is the same thing that's in the actual program. It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source. Trust matters, but the saying is "trust but verify". Sure, you can also fix vulns and bugs in closed source software, hell, you can even edit or build on top of it. But we don't call these things open source (or source available) for a reason.
If we're going to be consistent in our definitions, we need to understand what these things are at at least a minimal level of abstraction. And frankly, as a ML researcher, I just don't see it.
That said, I'm generally fine with "source available" and like most people use it synonymous with "open source". But if you're going to go around telling everyone they're wrong about the OSS definition, at least be consistent and stick to your values.
[0] https://opensource.org/osd
[1] Businesses who's entire model depends on OSS (by OS's definition) and freely available research
> "Reproducible build" is a term used to refer to getting an exact binary match out of a build.
I'm not sure what makes you think I failed to understand this. Allow me to quote myself >> It is fair to say that the "aim" isn't about reproducible software, but it is a direct consequence of the software being open source.
But also, my entire point is not really about the reproducible build aspect. It is that if we're going to draw an analogy then the training and data IS the source. At worst, we'd say it is the build instructions.But maybe I don't understand Open Source. Is it still Open Source if I provide the source code, an Apache License, but the code is in my own custom language (for fun, let's say it reads like brainfuck) and I have no released the compiler? Maybe some people would call this Open Source, but I imagine it would ruffle a lot of feathers. Is there really a meaningful difference between that an a binary? If it does fit "the letter of the law" it most certainly does not fit "spirit of the law". It is the spirit of the law that matters, because it is the whole fucking point.
Just like a free grazing field would allow living animals, but not a combine harvester. The old rules of "for any purpose" no longer apply.
Okay, well just because you have the domain name "opensource.org" doesn't give you the ability to speak for the community, and the community's understanding of the term.
opensource.org is irrelevant.
"never been to enable reproducible software"
I'd say, sure "Never" is a big word.
Having open code that everyone can read and run, was partly to allow for reproducibility. In the closed world, how is anybody reproducing anything, being open does enable that.