Those three concepts are enough to write numerically stable online blockwise FlashAttention. Loop tiling, loop fusion, storage folding, and (critically) online reduction rewriting fall out as _predictable_ consequences of the lowering.
The hope is that this scheduling model makes it easier for people and search algorithms to find performant schedules.
Current status: working proof-of-concept. Compiler/runtime in Rust, C backend, small Python front-end, zero dependencies. Not fast (yet).