But the technical aspects are pretty cool:
- Fault-tolerant training where nodes and be added and removed mid-run without interrupting the other nodes.
- Sending quantized gradients during the synchronization phase.
- (In the OpenDiLoCo article) Async synchronization.
They're also mentioning potential trustless systems where everyone can contribute compute, which would make this a truly decentralized open platform. Overall it'll be pretty interesting to see where this goes!
I did this 9 years ago, works pretty well. I don't understand why all ML isn't async and quantized like that now. This project quantizes to 1 bit per weight and it works so well I didn't even make it configurable.
does this basically correspond to moving each weight either up or down by a fixed amount? I'm a bit surprised you don't at least need a "stay same" bit, but i suppose it could balance out over multiple iterations.
Interesting that it works at all. Although, thinking on it, I could see it maybe even having a nice regularizing effect where every layer would end up have similar weight magnitudes. (like projecting onto the local n-ball as mentioned in a paper posted recently on HN)
The weight vectors themselves are regular floats. But the data exchanged between the machines is 1 bit. Basically, you keep track of changes to the weight vector which hasn't yet been propagated to the other machine. You quantize this to 1 bit per weight (ie. a sign bit) and send it, together with a single scale factor X, accumulating the quantization error for the next sync iteration.
You choose X to be the RMS or some similar metric of the accumulated error.
So, your garden-variety $0.5M desktop PC, then.
Cool, cool.
[1] https://viperatech.com/shop/nvidia-dgx-h100-p4387-system-640...
So they really are a 10x company.
Average house is 571kwh/month, this is 10.2kw max * 24 * 30 = 7344kwh
this will cost you, in california, about $3000 bucks a month depending on your power plan :)
> Decentralized training of INTELLECT-1 currently requires 8x H100 SXM5 GPUs.
me: and for that reason, I'm out
Also they state that later they will be adding the ability for you to contribute your own compute but how will they solve the problem of having to back-propagate to all of the remote nodes contributing to the project without egregiously slow training time?
Decentralized training would be when you can use consumer GPUs, but that's not likely to work with backpropagation directly, but maybe with one of the backpropagation approximating algorithms.
One hell of an uncited leap from "we're multiplying a lot of numbers" to "AGI", as if it is a given
So, mostly cost reduction mixed with some cloud, vendor diversity.
Can someone explain if it does reduce the model quality overall?
Allegedly not?
Ah, yes, Prime Intellect, the AGI that went foom and genocided the universe because it was commanded to preserve human civilization without regard for human values. A strong contender for the least evil hostile superintelligence in fiction. What a wonderful thing to name your AI startup after. What's next, creating the Torment Nexus?
(my position on the book as a whole is more complex, but... really? Really?)
That could have just been their private simulation. As far as I remember, it wouldn't even have outright lied to them, just let them believe they talked it into destroying itself.