Tried optimizing the kernel directly, tested ~14 approaches, none beat the baseline on Apple Silicon.
What ended up working was skipping value dequant for positions with negligible attention weight.
Flash attention computes weights before V accumulation, so you already know which positions won’t contribute.
At 32K context:
- ~90% of positions can be skipped
- +22.8% decode speedup (turbo3 KV)
- ~+5% even on q8_0 KV
- no PPL change
- NIAH improved (less quant noise in accumulation)
Also validated on M2 Pro, and currently being tested on CUDA.
Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/pap...