Great read! I saw you mention the design is pretty GPU-overfit. How hard would it be to adapt this to target TPUs for example, would it be relatively easy to port but hard to squeeze out good performance, or would it require a redesign from the ground up to even work on a decent level?
Hm, I think it is moderately hard to get good performance. Some things that come to mind that would need a change: 1. we don't care about warp-level optimizations in TPUs as there are no warps, so some optimizations in Beam Search would need to be reformulated 2. There isn't a classic command queue in TPUs like in GPUs, so runtime would differ
But in general I think it is feasible.
I can answer about things I know about. For example, Tenstorrent hardware has explicit movement ops executed on compiler. Rangeify phase in tinygrad wouldn't make sense for such architecture, since you can't just ditch data movement ops and replace them with ranges