How to scale RL to 10^26 FLOPs(blog.jxmo.io)

82 pointsby jxmorris127 months ago6 comments

moconnor7 months ago
A very long way of saying "during pretraining let the models think before continuing next-token prediction and then apply those losses to the thinking token gradients too."
It seems like an interesting idea. You could apply some small regularisation penalty to the number of thinking tokens the model uses. You might have to break up the pretraining data into meaningfully-paritioned chunks. I'd be curious whether at large enough scale models learn to make use of this thinking budget to improve their next-token prediction, and what that looks like.
tekacs7 months ago
Besides the great subject matter, I love how densely packed this article is with links to relevant papers and materials!
(Because I almost missed this) In the comments on the post someone linked to this paper: https://arxiv.org/html/2408.15240v1
mvkel7 months ago
Grok 4 is effectively Grok 3 with massively scaled RL, and the improvements on the benchmarks (and experientially) are minimal.
Is this a flaw in theory, or application?
- k__7 months ago
  Half-OT:
  Is there any info out there how the major models differ?
Iwan-Zotow7 months ago
More data? Where it supposed to come from?
sync_silver937 months ago
[flagged]
childintime7 months ago
The chemistry of RL dictates that 10^26 FLOPS is about 166 FLOP-mols. But how much weight is this? An electron/FLOP or 1eV/FLOP? That's 0.55mg or just 1ng. Regardless, I'd say it's close to 7 brainfucks, as it's common knowledge they exert a logarithmic force on the intelligence weighing apparatus, F = m * log2(a).