Controls ruled out: KV quantization (fp16 KV also regresses), output length (300 → 1000 tokens unchanged), and draft-model choice (Qwen3:0.6B has vocab 151936 and silently fails; Qwen3.5-0.8B matches vocab 248320 and loads correctly, still loses).
The pattern matches MoESD (arXiv 2505.19645) and Utility-Driven SD for MoE (arXiv 2506.20675). A3B has 3B active parameters out of ~35B total; with sparsity 0.031, the expert-saturation batch-size threshold is roughly 94 tokens. Draft K of 3–32 is well below that, so every drafted token pulls a fresh expert slice, and verification pays for the union. 100% acceptance cannot rescue this.
srogmann's own Qwen3.5-122B-A10B benchmark in PR #20075 shows +15–45% speedup, which is consistent with A10B being above the threshold. So the PR itself works as intended on A10B+; the phenomenon is class-specific to small-active MoE.
Raw per-request JSON, 3 matplotlib plots, aggregated CSV, BENCHMARK_ENV.md (driver, CUDA, commit, model SHA256), and the exact run_*_matrix.sh are in the repo. Happy to accept replications from other Ampere cards.