We started with PrimeRL and implemented Kimi in it, verifying it against the Moonshot API. The initial distributed training method, FSDP, is not ideal for memory bottlenecked MoEs, so we added support for Expert Parallel. This enabled faster training, but many optimizations remained. We discuss several in the post, and collectively, these efforts took us from training 125 tokens/s to 6,660 tokens/s on a single 8xH200 node! Per token, our codebase is cheaper than anything on the market, including training APIs like Tinker.
We plan to open source in the coming week or two, pending safety evals!
We're doing some safety work before the release. Specifically, we're checking for bio uplift: does OSing 50x faster OS training code for (relatively) smaller GPU setups seriously democratize dangerous bio capabilities?
We expect the answer is no, but it doesn't hurt to check. Once that's done, we'll drop the repo.
There is lots of value to be unlocked by people using language models for their own purposes, and our work here hopefully moves the needle towards making that more accessible to more people. (The training code will be released soon, pending safety testing.)
We are very excited by what we can all build :D
- Tim from WSL