50x Faster Post-Training(www.workshoplabs.ai)

7 pointsby addiefoote83 hours ago3 comments

addiefoote82 hours ago
Some more details on this: After realizing Hugging Face would be messy to work with to train Kimi-k2-thinking, we decided to do it ourselves.
We started with PrimeRL and implemented Kimi in it, verifying it against the Moonshot API. The initial distributed training method, FSDP, is not ideal for memory bottlenecked MoEs, so we added support for Expert Parallel. This enabled faster training, but many optimizations remained. We discuss several in the post, and collectively, these efforts took us from training 125 tokens/s to 6,660 tokens/s on a single 8xH200 node! Per token, our codebase is cheaper than anything on the market, including training APIs like Tinker.
We plan to open source in the coming week or two, pending safety evals!
loudeaglenoise2 hours ago
Hey all, Luke from Workshop Labs here. We're excited to get this in the hands of builders in the next 1-2 weeks.
We're doing some safety work before the release. Specifically, we're checking for bio uplift: does OSing 50x faster OS training code for (relatively) smaller GPU setups seriously democratize dangerous bio capabilities?
We expect the answer is no, but it doesn't hurt to check. Once that's done, we'll drop the repo.
kostolansky2 hours ago
We took another step toward making open source models truly open, namely by creating a training stack that actually allows for finetuning of a large frontier OSS model, Kimi K2 Thinking. (The OSS stack for big models is surprisingly pretty abysmal these days!)
There is lots of value to be unlocked by people using language models for their own purposes, and our work here hopefully moves the needle towards making that more accessible to more people. (The training code will be released soon, pending safety testing.)
We are very excited by what we can all build :D
- Tim from WSL
- addiefoote82 hours ago
  I'm also excited about the research that could be enabled by having weight-level access and fine tuning access on frontier open source models. There's a lot of interesting behavior that just doesn't exist in 8B parameter models, not to mention with architectural and training differences.