UPD. found no evidence that it supports tensor cores, so it's going to be many times slower than implementations that do.
https://github.com/m4rs-mt/ILGPU/compare/master...lostmsu:IL...
Good article: https://alexarmbr.github.io/2024/08/10/How-To-Write-A-Fast-M...