Fair point on HBM, but in a production system the orchestration layer is often the actual bottleneck. I've found that keeping the GPU fed requires a level of concurrency and stability that is hard to tune in Python. Rust is useful here not for the inference itself, but for ensuring the request pipeline doesn't choke while the GPU is waiting for data.
Have you tried Mojo for this purpose? Seems like it should work as well? I don’t have experience with it though