1 pointby pidtom6 hours ago1 comment
  • pidtom6 hours ago
    I’ve been working on KV cache compression and ran into a dequant bottleneck at long context.

    Tried optimizing the kernel directly, tested ~14 approaches, none beat the baseline on Apple Silicon.

    What ended up working was skipping value dequant for positions with negligible attention weight.

    Flash attention computes weights before V accumulation, so you already know which positions won’t contribute.

    At 32K context:

    - ~90% of positions can be skipped

    - +22.8% decode speedup (turbo3 KV)

    - ~+5% even on q8_0 KV

    - no PPL change

    - NIAH improved (less quant noise in accumulation)

    Also validated on M2 Pro, and currently being tested on CUDA.

    Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/pap...