1 pointby GaggiX6 hours ago1 comment

GaggiX6 hours ago
Not to be confused with Flash Attention.
What's novel here is the extremely small KV cache memory usage per long context windows, like 0.77GB with 512K, a 90% memory usage reduction compare to the already really small KV cache memory usage of Deepseek V4 Flash.