Hybrid Attention

36 pointsby JohannaAlmeida7 hours ago7 comments

bigbadfeline3 hours ago
I've been interested in faster attention and smaller models for some time but haven't had the time to do serious research so I can't answer your questions.
However, everything you do sounds very interesting, useful and well thought out, please keep doing it, I'd encourage others to work in the same direction too.
I hope, more of us can find the time for more than best wishes in the near future.
hackerman700004 hours ago
For the evaluation question: for small code models, try-to-compile rate on generated functions is the simplest metric that actually correlates with usefulness. Perplexity tells you the model learned the distribution, compilation rate tells you it learned the structure. Beyond that, exact match on function body completion given a signature is more informative than open ended generation benchmarks
JohannaAlmeida7 hours ago
Full attention O(n²): 17.96s / 5.6 tok/s
HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s
empath756 hours ago
Is this for just like auto complete, because you are not going to get anything very useful out of a code-only training set.
- JohannaAlmeida6 hours ago
  Yeah auto complete is an amazing use case. I needed a small model that used transformers , could fit on my weak consumer GPU .
  So i needed to make fundamental arquitecture changes .Do some KV cache tricks.
  And then prove the new arquitecture was faster with benchmarks and perplexity was acceptable.
- bigbadfeline3 hours ago
  Well, coding is a kind of extended autocomplete - I prefer that way of working because I don't like the mess created by LLMs when you let them work on their own. Smaller models, specialized on a single language, make a lot of sense.
- altruios5 hours ago
  I think it's more a proof of concept: locally trained. It would take lots of resources/time to train something non-trivial.
woodson6 hours ago
Look into RWKV.
- JohannaAlmeida5 hours ago
  Yeah RWKV is definitely related in spirit (recurrent state for long context). Here I’m combining local windowed attention with a gated recurrent path + KV cache compression, so it’s more hybrid than fully replacing attention
MarcelinoGMX3C5 hours ago
[dead]
Aegis_Labs6 hours ago
[dead]