That said, I’m unclear how much this helps in practice; we don’t usually parse through say 32 responses from our 2B parameter models. I guess if you instrumented parallel reasoning processes in batch this might be helpful. Perhaps that’s what o1-pro is doing in the background, actually.
Anyway, this one seems to me like it might make its way onto the “good idea” list when rl is available in the training pipeline.
Cerebras has used optillm for optimising inference with techniques like CePO and LongCePO.
Take N gates, normalize them, represent them as points on the surface of a hypersphere. Quantize the hypersphere as coarsely as you need to get the precision you want. Want less precision but your quantization is getting too coarse? Increase N.
Fast algebraic codes exist to convert positions on a hyperspheric-ish surfaces to indexes and vice versa.
Perhaps spherical VQ isn't ideal-- though I suspect it is, since groups of weights often act as rotations naturally-- but some other geometry should be good if not.