Wisdom from CPU land translate well to GPUs. Static Scheduling, Pre-fetching, 3-Stage Double-Buffering, Pre-allocation & memory ordering in custom CUDA kernel helps outperform NVIDIA NCCL. Experimental integration in vllm.rs shows ~20% prefill and ~10% decode latency improvements (TTFT & TPOT)