The interesting part was the quant format choice. NVFP4 is Blackwell’s native 4-bit format and theoretically the fastest path, but MoE support for Gemma 4 specifically was blocked on an unmerged vLLM PR (#39045) — linear layers loaded, expert weights didn’t. Falling back to nightly didn’t help because that day’s nightly was broken by someone landing an unconditional pandas import in the AITER code path without updating the image’s deps. Ended up on AWQ + Marlin kernels, which has been stable in vLLM for over a year. For single-user memory-bandwidth-bound decode the gap to NVFP4 is smaller than you’d expect — both hit the same 4x weight compression, and AWQ dequantizes to FP16 in-register rather than using FP4 tensor cores. I’m getting ~196 tok/s; I’d estimate NVFP4 would be 220-240 if it had worked.
Happy to dig into the vLLM config, the RunPod Serverless side, or the NVFP4 vs AWQ tradeoff in more depth.