Qwen3.6-35B-A3B speculative decoding is net-negative on RTX 3090(github.com)

5 pointsby thc10065 hours ago1 comment

thc10065 hours ago
Author here. Following llama.cpp PR #19493 (speculative checkpointing) merge on 2026-04-19, I ran a 19-configuration matrix on a single RTX 3090 with Qwen3.6-35B-A3B (UD-Q4_K_XL, 21 GB on disk). None of ngram-cache, ngram-mod (including srogmann's recommended n=24 --draft-min 48 --draft-max 64), or classic --model-draft with the vocab-matched Qwen3.5-0.8B achieves a net speedup over the non-speculative baseline of 135.7 tok/s. Every draft-enabled config hits a bimodal tail of 59–67 tok/s on reasoning / code prompts despite 100 % draft acceptance.
Controls ruled out: KV quantization (fp16 KV also regresses), output length (300 → 1000 tokens unchanged), and draft-model choice (Qwen3:0.6B has vocab 151936 and silently fails; Qwen3.5-0.8B matches vocab 248320 and loads correctly, still loses).
The pattern matches MoESD (arXiv 2505.19645) and Utility-Driven SD for MoE (arXiv 2506.20675). A3B has 3B active parameters out of ~35B total; with sparsity 0.031, the expert-saturation batch-size threshold is roughly 94 tokens. Draft K of 3–32 is well below that, so every drafted token pulls a fresh expert slice, and verification pays for the union. 100% acceptance cannot rescue this.
srogmann's own Qwen3.5-122B-A10B benchmark in PR #20075 shows +15–45% speedup, which is consistent with A10B being above the threshold. So the PR itself works as intended on A10B+; the phenomenon is class-specific to small-active MoE.
Raw per-request JSON, 3 matplotlib plots, aggregated CSV, BENCHMARK_ENV.md (driver, CUDA, commit, model SHA256), and the exact run_*_matrix.sh are in the repo. Happy to accept replications from other Ampere cards.