1 pointby onurkanbkrc6 hours ago1 comment
  • storystarling5 hours ago
    The point about attention not being a bottleneck is true for training compute, but for inference the memory bandwidth required for the KV cache is the main constraint. That seems to be what actually drives the unit economics and API pricing for serving these models.