You want to minimize the real and energy costs at the expense of time.
Assuming NPUs don't get pulled from consumer hardware altogether, theoretically the time/efficiency trade-off gap will become smaller and smaller as time goes on.
Or maybe I'm misinterpreting press releases, as evidently Notebookcheck.net lied to me years ago :(
[1] https://www.notebookcheck.net/AMD-details-4-nm-Zen-4-Ryzen-7...
Like, I'm sitting here on the sidelines and thinking that someone is going to do implement this stuff before I even get a chance, which is why I never mention the blatantly obvious communication pattern that is breathing down your neck that the AI Engines are begging you to implement. Doing Flash Attention is slightly more difficult, but not meaningfully so.
If you are using broadcasting to spread your A and B matrices, you're doing it wrong. You can do the thing that others do inside their processor "outside". Once you understand that, you will start to realize that this is actually the best possible architecture for dense GEMM and dense FlashAttention.