The pyramid approach ended up being the most interesting finding – recent tokens stay FP32, middle-aged go INT8, old tokens drop to INT4. Gets you 2.8x memory reduction at 0.996 cosine similarity to the FP32 baseline. Turns out tokens age out of relevance and precision should follow.
All code is pure C, no dependencies, ~2,400 lines. Every figure in the article is reproducible.