5 pointsby ppadjin1236 hours ago4 comments

5 hours ago
undefined
msisovic5 hours ago
Great read! I saw you mention the design is pretty GPU-overfit. How hard would it be to adapt this to target TPUs for example, would it be relatively easy to port but hard to squeeze out good performance, or would it require a redesign from the ground up to even work on a decent level?
- ppadjin1235 hours ago
  Hm, I think it is moderately hard to get good performance. Some things that come to mind that would need a change: 1. we don't care about warp-level optimizations in TPUs as there are no warps, so some optimizations in Beam Search would need to be reformulated 2. There isn't a classic command queue in TPUs like in GPUs, so runtime would differ But in general I think it is feasible.
withoutJ5 hours ago
I had a lot of fun reading this blog!
I am thinking, how feasible is it to adapt Tinygrad’s MovementOp/ReduceOp fusion patterns for non-CUDA backends, such as custom RISC-V accelerators?
- ppadjin1235 hours ago
  I can answer about things I know about. For example, Tenstorrent hardware has explicit movement ops executed on compiler. Rangeify phase in tinygrad wouldn't make sense for such architecture, since you can't just ditch data movement ops and replace them with ranges
6 hours ago
undefined