2 pointsby nadeko1232 hours ago1 comment
  • nadeko1232 hours ago
    We implemented 10 KV-cache compression strategies from scratch in C and benchmarked them against each other on quality, memory, and throughput. Everything from symmetric INT8 to H2O eviction to a pyramid scheme (the good kind) that assigns different precision based on token age.

    The pyramid approach ended up being the most interesting finding – recent tokens stay FP32, middle-aged go INT8, old tokens drop to INT4. Gets you 2.8x memory reduction at 0.996 cosine similarity to the FP32 baseline. Turns out tokens age out of relevance and precision should follow.

    All code is pure C, no dependencies, ~2,400 lines. Every figure in the article is reproducible.