1 pointby fm13204 hours ago2 comments
  • storystarling3 hours ago
    Fair point on HBM, but in a production system the orchestration layer is often the actual bottleneck. I've found that keeping the GPU fed requires a level of concurrency and stability that is hard to tune in Python. Rust is useful here not for the inference itself, but for ensuring the request pipeline doesn't choke while the GPU is waiting for data.
    • fm13203 hours ago
      Have you tried Mojo for this purpose? Seems like it should work as well? I don’t have experience with it though
  • 4 hours ago
    undefined