joesharratt296 hours ago
FP4 attention is fast but in long-context settings its quality degrades. ThriftAttention solves this by computing the most important 5% of blocks in FP16, the remainder in FP4. We find that this approach recovers over 90% of the performance gap between FP4 and FP16 on long-context evaluation benchmarjs. The repo has a graph of the tradeoffs: https://github.com/joesharratt1229/ThriftAttention