Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT(pythongiant.github.io)

18 pointsby pythongiant3 hours ago6 comments

stpedgwdgfhgddan hour ago
I just dont get why people choose Python and not e.g. Go for high performance problems.
- larme10 minutes ago
  Go is not high performance enough. Like what others said, you implement the high performance part in C++ and use python to glue them.
- Yoric37 minutes ago
  Go is pretty good at performance, but pretty bad at expressing domain-specific logics. Python is the opposite, but once you have isolated the parts that need to be optimized, it's quite easy to rewrite them in a native language (in particular, the Rust-Python bindings are really good, although in this project, it's C++).
- sigmoid10an hour ago
  Python is a very convenient skeleton for gluing together high performance modules that were written in C or cuda. Writing boilerplate code in those to adapt them to your project is much more inconvenient.
hexnuts2 hours ago
Bad site design, if I can't scroll to see the next slide, that's just broken.
x0rumanan hour ago
The functionality is impressive, but the website needs some work
sakexan hour ago
Is this based on paged attention with hashing of the pages?
pythongiant3 hours ago
KVBoost is a chunk-level KV cache reuse library for HuggingFace models (pip install kvboost). It supports two recompute strategies (selective boundary and CacheBlend), int8/int4 KV quantization for 2–4x RAM reduction, disk-backed cold storage, and 11 architectures including Llama, Qwen, Gemma, Mistral, and Phi. On Qwen2.5-3B we measured 47.9x TTFT speedup on an 8-turn conversation, 21x on code context reuse, 100–743x faster than MLX, and 3–41x faster than vLLM-MLX — including interior chunk reuse where vLLM gets zero hits. Outputs are token-for-token identical to baseline under greedy decoding. Works best on 3B+ models with 500+ token shared context. GitHub: https://github.com/pythongiant/KVBoost
- snovv_crash2 hours ago
  Even the things that should be normal dashes are em-dashes
  - mroban hour ago
    En-dashes are not em-dashes, and they're standard typography for numeric ranges.
    https://en.wikipedia.org/wiki/Dash#Ranges_of_values
- arjie2 hours ago
  I don't get it. The output of the CacheBlend paper is in LMCache. Did you compare against vLLM with LMCache? This is confusing.
- pferdone2 hours ago
  slop
3 hours ago
undefined