Hacker News
new
top
best
ask
show
job
Theoretical Bottlenecks for Scaling LLM Inference to Get Higher Token per Second
(
twitter.com
)
1 point
by
arjmandi
2 hours ago
1 comment
arjmandi
2 hours ago
LLM inference performance is governed by three competing bottlenecks: compute time, memory bandwidth, and communication latency. In this post, we've covered what allows full hardware utilization and key constraints.