[0] For example, gemm but the lhs is in fp8 e4m3 and rhs is in bf16 and we want fp32 accumulation, output to bf16 after applying GELU.
The project feels very nice and it would be great to have more notes in the README on the excluded functionality to better scope its applicability in more advanced GPGPU scenarios.
In Halide, the concept was great, yet the problems in kernel development were moved to the side of "scheduling", i.e. determining tiling/vectorization/parallellization for the kernel runs.
Given that it can target WGPU I'm really wondering why OpenCL isn't included as a backend. One of my biggest complaints about GPGPU stuff is that so many of the solutions are GPU only, and often only target the vendor compute APIs (CUDA, ROCm) which have much narrower ecosystem support (versus an older core vulkan profile for example).
It's desirable to be able to target CPU for compatibility, debugging, and also because it can be nice to have a single solution for parallelizing all your data heavy work. The latter reduces mental overhead and permits more code reuse.
One thing I've never investigated is how performance OpenCL actually is for CPU. Do you happen to have any resources comparing it to a more native CPU implementation?
It isn't really related to your question but I think the FluidX3D benchmarks [2] illustrate that OpenCL is at least viable across a wide variety of hardware.
As far as targeting CPUs in a release build it's not a particular backend that's important to me. The issue is at the source code level. Having single source is nice but you're still stuck with these two very different approaches. It means that the code is still clearly segmented and thus retargeting any given task (at least nontrivial ones) involves rewriting it to at least some extent.
Contrast that with a model like OpenMP where the difference between CPU and GPU is marking the relevant segment for offload. Granted that you'll often need to change algorithms when switching to achieve reasonable performance but it's still a really nice quality of life feature not to have to juggle more paradigms and libraries.
[0] https://github.com/pocl/pocl
Since we don't want to rewrite everything multiple times, it also has to be multi-platform and optimal, so the feature set must be per-device, not per-language. I'm not aware of a tool that does that, especially in Rust (which Burn is written in).
Jax? But then you're stuck in python. SYCL?
But yeah not for Rust. This project is filling a prominent hole IMO.
Also, to whom do you have to thank LLVM exists in first place, and has not fizzled out as yet another university compiler research project?
Generally wgpu is open to supporting any Metal extensions you need. There's usually an analogous extension in one of the other backends (e.g., Vulkan, DX12) anyway.