It is more a software problem and the next breakthrough will come from clever algorithms.
You have just seen TurboQuant create promising efficiency gains and there many other papers being released that propose more optimisations from software that make it possible to run 100B+ LLMs on device.
I don't know if I can agree. The hardware side is extremely sub-optimal on raster-focused GPU architectures like Apple Silicon. If I had to bet, the hardware will improve a lot more than the software will over the next 10 years as more vendors adopt GPGPU characteristics.
> You have just seen TurboQuant create promising efficiency gains
TurboQuant looks like a vibe-laundered implementation of EDEN quantization: https://openreview.net/forum?id=tO3ASKZlok