1 pointby onurkanbkrc13 days ago1 comment

storystarling12 days ago
The point about attention not being a bottleneck is true for training compute, but for inference the memory bandwidth required for the KV cache is the main constraint. That seems to be what actually drives the unit economics and API pricing for serving these models.