2 pointsby melihelibol2 hours ago1 comment
  • melihelibolan hour ago
    Hello,

    I built cuTile Rust and just posted the paper preprint. Happy to answer questions.

    TL;DR: Rust gives you fearless concurrency on the CPU, but GPU kernel programming still requires unsafe code. cuTile Rust carries Rust's ownership model across the launch boundary and maintains it via a safe tile-based programming model that compiles to Tile IR. The host-side GPU work you write composes into synchronous launches, async pipelines, or a CUDA graph you capture once and replay.

    It works by letting you partition mutable output tensors into disjoint pieces on the host: Each tile program gets an exclusive `&mut` view of its piece and the inputs as shared `&` reads. Kernels are written with single-threaded semantics, and the compiler maps that to thread blocks and manages shared memory. Because the pieces are provably disjoint and ordering is threaded through mutable references, you get compile-time data-race freedom when using the safe surface API. As with any other Rust tool, safety remains extensible: Any functionality not yet exposed safely can be made available by writing your own safe abstractions over unsafe code where you supply the invariants yourself. The lower-level Tile IR operation surface (as of CUDA 13.3) is exposed through unsafe intrinsics.

    On the B200, an optimized safe GEMM kernel is competitive with cuBLAS: It's within 0.3% of a hand-written low-level (Tile IR) variant at ~92% of the GPU's dense f16 peak. Element-wise is ~7 TB/s. So on these kernels the safety is effectively free. We also worked with Hugging Face to evaluate Grout, a Qwen3 inference engine built on cuTile Rust. In batch-1 Qwen3 decode, Grout reaches 171 tokens/s for Qwen3-4B on an NVIDIA GeForce RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, showing strong performance on memory-bound inference tasks, consistent with our HBM roofline analysis. Benchmarks (harnesses + CSVs), hardware, and clock settings can be found in the repo. The full methodology is available in the paper: https://arxiv.org/abs/2606.15991

    This is an early-stage research release: the tensor API is young (a few patterns still need raw pointers), GEMM trails cuBLAS at some sizes, and the tile model gives up SIMT-level control (explicit warp primitives, manual shared memory) in exchange for the semantics that make safety checkable. The recent 0.2.0 release added low-precision support (FP4 packing and block-scaled MMA on CUDA 13.3). These only just landed, so we haven't benchmarked them yet, but we expect them to perform well. Also, while Tile IR is not portable across GPU vendors, it is portable across NVIDIA GPU architectures: What you write in cutile-rs will work on sm_80+ (Ampere and up) with CUDA 13.3, but hardware-specific features such as native FP4 require architectures that support them.

    The latest release is on crates.io. After setting up CUDA 13, you can `cargo add cutile` to pull it into your own project, or clone the repo and `cargo run -p cutile-examples --example hello_world` to try it out. Most of what's here has co-evolved with community feedback and contributions since we made the repo public. We read everything that comes in, so anything you raise will shape the direction of this project. If you have a cool feature idea, open an issue or a PR and let's discuss.