1 pointby fm132014 days ago2 comments

storystarling14 days ago
Fair point on HBM, but in a production system the orchestration layer is often the actual bottleneck. I've found that keeping the GPU fed requires a level of concurrency and stability that is hard to tune in Python. Rust is useful here not for the inference itself, but for ensuring the request pipeline doesn't choke while the GPU is waiting for data.
- fm132014 days ago
  Have you tried Mojo for this purpose? Seems like it should work as well? I don’t have experience with it though
14 days ago
undefined