INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model(www.primeintellect.ai)

111 pointsby jasondavies9 months ago16 comments

PoignardAzur9 months ago
A lot of comment are sneering at various aspects of this press release, and yeah, there's some cringeworthy stuff.
But the technical aspects are pretty cool:
- Fault-tolerant training where nodes and be added and removed mid-run without interrupting the other nodes.
- Sending quantized gradients during the synchronization phase.
- (In the OpenDiLoCo article) Async synchronization.
They're also mentioning potential trustless systems where everyone can contribute compute, which would make this a truly decentralized open platform. Overall it'll be pretty interesting to see where this goes!
- londons_explore9 months ago
  > Sending quantized gradients during the synchronization phase.
  I did this 9 years ago, works pretty well. I don't understand why all ML isn't async and quantized like that now. This project quantizes to 1 bit per weight and it works so well I didn't even make it configurable.
  https://github.com/Hello1024/shared-tensor
  - radarsat19 months ago
    > 1 bit per weight
    does this basically correspond to moving each weight either up or down by a fixed amount? I'm a bit surprised you don't at least need a "stay same" bit, but i suppose it could balance out over multiple iterations.
    Interesting that it works at all. Although, thinking on it, I could see it maybe even having a nice regularizing effect where every layer would end up have similar weight magnitudes. (like projecting onto the local n-ball as mentioned in a paper posted recently on HN)
    londons_explore9 months ago
    This is for keeping the weight vectors in sync between two machines.
    The weight vectors themselves are regular floats. But the data exchanged between the machines is 1 bit. Basically, you keep track of changes to the weight vector which hasn't yet been propagated to the other machine. You quantize this to 1 bit per weight (ie. a sign bit) and send it, together with a single scale factor X, accumulating the quantization error for the next sync iteration.
    You choose X to be the RMS or some similar metric of the accumulated error.
    f_devd9 months ago
    It has been more formally studied in signSGD[0], and empirically it's comparable to Adam in terms of behavior.
    [0]: https://arxiv.org/pdf/1802.04434
oefrha9 months ago
Well I don’t have 8xH100s, but if I do, I’m probably not gonna donate it a VC-funded company. Remember “Open”AI?
https://pitchbook.com/profiles/company/588977-92
- jgalt2129 months ago
  Very true, but if something similar were run by BOINC, I'd make a stab at contributing.
  https://boinc.berkeley.edu/
- csomar9 months ago
  I don't know the intricacies of their VC deal. But if the data is open and users put in xx amount of compute and then get the model; then where is the possible harm? The trade is done and dealt. You provided some of compute and got it back, right? Unless I am misunderstanding something about their distributed model or not reading the fine prints.
ukuina9 months ago
> Decentralized training of INTELLECT-1 currently requires 8x H100 SXM5 GPUs.
So, your garden-variety $0.5M desktop PC, then.
Cool, cool.
[1] https://viperatech.com/shop/nvidia-dgx-h100-p4387-system-640...
- DannyBee9 months ago
  If you run it continuously for a month, it will take 13x the electric usage of your average california house.
  So they really are a 10x company.
  Average house is 571kwh/month, this is 10.2kw max * 24 * 30 = 7344kwh
  this will cost you, in california, about $3000 bucks a month depending on your power plan :)
  - 01HNNWZ0MV43FF9 months ago
    What if I run it for a year?
    brysonreece9 months ago
    $3000 * 12 = ???
ikeashark9 months ago
me: Oh cool, a project like Folding@Home but for AI compute, maybe I'll contribute as we-
> Decentralized training of INTELLECT-1 currently requires 8x H100 SXM5 GPUs.
me: and for that reason, I'm out
Also they state that later they will be adding the ability for you to contribute your own compute but how will they solve the problem of having to back-propagate to all of the remote nodes contributing to the project without egregiously slow training time?
macrolime9 months ago
Not exactly what I would call decentralized training. More like distributed through multiple data centers.
Decentralized training would be when you can use consumer GPUs, but that's not likely to work with backpropagation directly, but maybe with one of the backpropagation approximating algorithms.
- dartos9 months ago
  Didn’t bloom do this with their petals tool?
m3kw99 months ago
But I can already train from 30 different vendors distributed across the US, why do I need to use a “decentralized” training system? Decentralized inferercing makes more sense as that is where things can be censored
dmitrygr9 months ago
> solve decentralized training step-by-step to ensure AGI will be open-source, transparent, and accessible
One hell of an uncited leap from "we're multiplying a lot of numbers" to "AGI", as if it is a given
- DannyBee9 months ago
  Well i mean, it's a group of people who are doing "open, decentralized" training that requires half a million worth of non-consumer hardware and 3000 a month in electricity. Would you expect anything less than silicon valley level arrogance?
mountainriver9 months ago
This is cool work, I’ve been watching the slow evolution of this space for a couple years and it feels like a good way we can ensure AI is owned and accessible to everyone.
openrisk9 months ago
For some purposes a decentrally trained, open source LLM could be just fine? E.g. you want a stochastic parrot that is trained on a large, general purpose corpus of genuine public domain / creative commons content. Having such a tool widely available is still a quantum leap versus Lore Ipsum. Up to point you can take your time. There is no manic race to capitalize any hype. "slow open AI" instead of "fast closed AGI". Helpfully, the nature of the target corpus does not change every day. You can imagine, e.g., annual revisions, trained and rolled-out leisurely. Both costs and benefits get widely distributed.
James_K9 months ago
My initial was quite negative, but having thought it through, I can see the logic in this. Having open models is better than closed models. That said, this page seems like a joke. Someone drank a little too much AI-koolaid methinks.
not_a_dane9 months ago
Decentralised but very high entry barrier.
nickpsecurity9 months ago
The main benefit of this type of decentralization seems to be minimizing the node cost. One can rent the cheapest nodes to use in the system. Even the temporary instances can be replaced with others. It’s also easy for system owners to donate time.
So, mostly cost reduction mixed with some cloud, vendor diversity.
pizza9 months ago
So just spitballing here but this is likely a souped-up reverse engineered DisTrO [0] under the hood, right? Or could it be something else?
[0] https://www.youtube.com/watch?v=eLMJoCSjFbs
mt_9 months ago
> We quantize the pseudo-gradients to int8, reducing communication requirements by 400x.
Can someone explain if it does reduce the model quality overall?
- vessenes9 months ago
  To give some intuition here, it’s not crazy to think that getting a bunch of different 8 bit precision information intended to be combined would get you roughly 32 bits of precision. Especially when it’s not always (often?) the case that for a particular weight you’ll need the edges of that mantissa.
- PoignardAzur9 months ago
  > In our experiments, we found that we are able to perform int8 quantization on the pseudo gradients without any impact on the loss curves.
  Allegedly not?
- empiko9 months ago
  The gradients are noisy as they are, this additional noise probably does not hurt that much overall
monkeydust9 months ago
Yea, come back when you can do this on BOINC.
saulrh9 months ago
> Prime Intellect
Ah, yes, Prime Intellect, the AGI that went foom and genocided the universe because it was commanded to preserve human civilization without regard for human values. A strong contender for the least evil hostile superintelligence in fiction. What a wonderful thing to name your AI startup after. What's next, creating the Torment Nexus?
(my position on the book as a whole is more complex, but... really? Really?)
- robertclaus9 months ago
  You may as well just go with Roko's Basilisk.
- cmrx649 months ago
  Least evil… strong words.
  - saulrh9 months ago
    It did host a successful and substantially-satisfying human civilization, at least until it let a couple of presumptuous self-important anarchoprimitivists kill it and genocide its subjects. Even if it was only a temporary and unstable illusion of alignment, that's one more values-satisfying civilization than the overwhelming majority of paperclippers manage. So yeah. Good? No. Least evil? Maybe.
    rep_lodsb9 months ago
    >until it let a couple of presumptuous self-important anarchoprimitivists kill it and genocide its subjects
    That could have just been their private simulation. As far as I remember, it wouldn't even have outright lied to them, just let them believe they talked it into destroying itself.
  - gryfft9 months ago
    GP did specify least evil hostile SI.
- QuesnayJr9 months ago
  After reading that Torment Nexus post you didn't have the urge to name an AI product Torment Nexus? Really?