Rust’s Standard Library on the GPU(www.vectorware.com)

255 pointsby justaboutanyone15 days ago13 comments

shihab11 days ago
To the author (or anyone from vectorware team), can you please give me, admittedly a skeptic, a motivating example of a "GPU-native" application?
That is, where does it truly make a difference to dispatch non-parallel/syscalls etc from GPU to CPU instead of dispatching parallel part of a code from CPU to GPU?
From the "Announcing VectorWare" page:
> Even after opting in, the CPU is in control and orchestrates work on the GPU.
Isn't it better to let CPUs be in control and orchestrate things as GPUs have much smaller, dumber cores?
> Furthermore, if you look at the software kernels that run on the GPU they are simplistic with low cyclomatic complexity.
Again, there's a obvious reason why people don't put branch-y code on GPU.
Genuinely curious what I'm missing.
- ukoki10 days ago
  Not OP but I'm currently make a city-builder computer game with a large procedurally-generated world. The terrain height at any point in the world is defined by function that takes a small number of constant parameters, and the horizontal position in the world, to give the height of the terrain at that position.
  I need the heights on the GPU so I can modify the terrain meshes to fit the terrain. I need the heights on the CPU so I can know when the player is clicking the terrain and where to place things.
  Rather than generating a heightmap on the CPU and passing a large heightmap texture to the GPU I have implemented the identical height generating functions in rust (CPU) and webgl (GPU). As you might imagine, its very easy for these to diverge and so I have to maintain a large set of tests that verify that generated heights are identical between implementations.
  Being able to write this implementation once and run it on the CPU and GPU would give me much better guarantees that the results will be the same. (although necause of architecture differences and floating point handling they the results will never be perfect, but I just need them to be within an acceptable tolerance)
  - xmcqdpt210 days ago
    That's a good application but likely not one requiring a full standard library on the GPU? Procedurally generated data on GPU isn't uncommon AFAIK. It wasn't when I was dabbling in GPGPU stuff ~10 years ago.
    If you wrote in open cl, or via intel libraries, or via torch or arrayfire or whatever, you could dispatch it to both CPU and GPU at will.
  - moron4hire10 days ago
    There are GPU-based picking algorithms. You really should not have to maintain parallel data generation systems on both the GPU and CPU just to support picking. Maybe you have a different issue that would require it, but picking alone shouldn't be it.
- nicman2310 days ago
  in large sim systems p2p comms and not having to involve the cpu in any way - because the cpu is doing work as well and you do not want to have the cpu to sync every result if it is partial
  one example is pme decomposition in gromacs.
- storystarling11 days ago
  The killer app here is likely LLM inference loops. Currently you pay a PCIe latency penalty for every single token generated because the CPU has to handle the sampling and control logic. Moving that logic to the GPU and keeping the whole generation loop local avoids that round trip, which turns out to be a major bottleneck for interactive latency.
  - radarsat110 days ago
    I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.
    storystarling10 days ago
    Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.
    radarsat110 days ago
    I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.
  - tucnak10 days ago
    Turns out how? Where are the numbers?
    storystarling10 days ago
    It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution.
    tucnak10 days ago
    I'm not convinced. (A bit of advice: if you wish to make a statement about performance, always start by measuring things. Then when somebody asks you for proof/data, you would already have it.) If what you're saying were true, it would be a big deal, except unfortunately it isn't.
    Dispatch has overheads, but it's largely insignificant. Where it otherwise would be significant:
    1. Fused kernels exist
    2. CUDA graphs (and other forms of work-submission pipelining) exist
    saagarjha10 days ago
    CUDA graphs are pretty slow at synchronizing things.
pixelpoet11 days ago
GPUs aren't fast because they run standard CPU code with magic pixie dust, they're fast because they're specialised vector processors running specialised vector code.
Cuda can also do C++ new, delete and virtual functions and exception handling and all the rest. And if you use that stuff, you're basically making an aeroplane flap its wings, with all the performance implications that come with such an abomination.
inb4 these guys start running Python and Ruby on GPU for "speed", and while they're at it, they should send a fax to Intel and AMD saying "Hey guys, why do you keep forgetting to put the magic go-fast pixie dust into your CPUs, are you stupid?"
- zozbot23410 days ago
  Nope, the magic pixie dust language that was supposed to run Python-like code on GPU was Mojo /s
  this is really just leveraging Rust's existing, unique fit across HPC/numerics, embedded programming, low-level systems programming and even old retro-computing targets, and trying to expand that fit to the GPU by leveraging broad characteristics that are quite unique to Rust and are absolutely relevant among most/all of those areas.
  The real GPU pixie dust is called "lots of slow but efficient compute units", "barrel processing", "VRAM/HBM" and "non-flat address space(s) with explicit local memories". And of course "wide SIMD+SPMD[0]" which is the part you already mentioned and is in fact somewhat harder to target other than in special cases (though neural inference absolutely relies on it!). But never mind that. A lot of existing CPU code that's currently bottlenecked on memory access throughput will absolutely benefit from being seamlessly ran on GPU.
  [0] SPMD is the proper established name for what people casually call SIMT
nu11ptr11 days ago
I feel like the title is a bit misleading. I think it should be something like "Using Rust's Standard Library from the GPU". The stdlib code doesn't execute on the GPU, it is just a remote function call, executed on the CPU, and then the response is returned. Very neat, but not the same as executing on the GPU itself as the title implies.
- mkj11 days ago
  > For example, std::time::Instant is implemented on the GPU using a device timer
  The code is running on the gpu there. It looks like remote calls are only for "IO", the compiled stdlib is generally running on gpu. (Going just from the post, haven't looked at any details)
  - monocasa11 days ago
    Which is a generally valid implementation of IO. For instance on the Nintendo Wii, the support processor ran its own little microkernel OS and exposed an IO API that looked like a remote filesystem (including plan 9 esque network sockets as filesystem devices).
  - rao-v11 days ago
    I'm surprised this article doesn't provide a bigger list of calls that run on the gpu and further examples of what needs some cpu interop.
    LegNeato11 days ago
    Flip on the pedantic switch. We have std::fs, std::time, some of std::io, and std::net(!). While the `libc` calls go to the host, all the `std` code in-between runs on the GPU.
- kjuulh11 days ago
  I think it fits quite well. Kind of like the rust standard lib runs on the cpu this does partially run on the gpu. The post does say they fall back on syscalls but for others there a native calls on the gpu itself such as Instant. The same way the standard lib uses syscalls on the cou instead of doing everything in process
- LegNeato11 days ago
  Author here! Flip on the pedantic switch, we agree ;-)
koyote11 days ago
Are there any details around how the round-trip and exchange of data (CPU<->GPU) is implemented in order to not be a big (partially-hidden) performance hit?
e.g. this code seems like it would entirely run on the CPU?
```
    print!("Enter your name: ");
    let _ = std::io::stdout().flush();
    let mut name = String::new();
    std::io::stdin().read_line(&mut name).unwrap();
```
But what if we concatenated a number to the string that was calculated on the GPU or if we take a number:
```
    print!("Enter a number: ");
    [...] // string number has to be converted to a float and sent to the GPU
    // Some calculations with that number performed on the GPU
    print!("The result is: " + &the_result.to_string()); // Number needs to be sent back to the CPU
```
Or maybe I am misunderstanding how this is supposed to work?
- kig11 days ago
  "We leverage APIs like CUDA streams to avoid blocking the GPU while the host processes requests.", so I'm guessing it would let the other GPU threads go about their lives while that one waits for the ACK from the CPU.
  I once wrote a prototype async IO runtime for GLSL (https://github.com/kig/glslscript), it used a shared memory buffer and spinlocks. The GPU would write "hey do this" into the IO buffer, then go about doing other stuff until it needed the results, and spinlock to wait for the results to arrive from the CPU. I remember this being a total pain, as you need to be aware of how PCIe DMA works on some level: having your spinlock int written to doesn't mean that the rest of the memory write has finished.
- LegNeato11 days ago
  We use the cuda device allocator for allocations on the GPU via Rust's default allocator.
  - saagarjha10 days ago
    Have you considered “allocating” out of shared memory instead?
- zozbot23411 days ago
  Why are you assuming that this is intended to be performant, compared to code that properly segregates the CPU- and GPU-side? It seems clear to me that the latter will be a win.
  - koyote11 days ago
    I am not assuming it to be performant, but if you use this in earnest and the implementation is naive, you'll quickly have a bad time from all the data being copied back and forth.
    In the end, people program for GPUs not because it's more fun (it's not!), but because they can get more performance out of it for their specific task.
the__alchemist11 days ago
I'm confused about this: As the article outlines well, Std Rust (over core) buys you GPOS-provided things. For example:
```
  - file system
  - network interfaces
  - dates/times
  - Threads, e.g. for splitting across CPU cores
```
The main relevant one I can think which applies is an allocator.
I do a lot of GPU work with rust: Graphics in WGPU, and Cuda kernels + cuFFT mediated by Cudarc (A thin FFI lib). I guess, running Std lib on GPU isn't something I understand. What would be cool is the dream that's been building for decades about parallel computing abstractions where you write what looks like normal single-threaded CPU code, but it automagically works on SIMD instructions or GPU. I think this and CubeCL may be working towards that? (I'm using Burn as well on GPU, but that's abstracted over)
Of note: Rayon sort of is that dream for CPU thread pools!
- zozbot23411 days ago
  The GPU shader just calls back to the CPU which executes the OS-specific function and relays the answer to the GPU side. It might not make much sense on its own to have such strong coupling, but it gives you a default behavior that makes coding easier.
- rl311 days ago
  >What would be cool is the dream that's been building for decades about parallel computing abstractions where you write what looks like normal single-threaded CPU code, but it automagically works on SIMD instructions or GPU.
  I've had that same dream at various points over the years, and prior to AI my conclusion was that it was untenable barring a very large, world-class engineering team with truckloads of money.
  I'm guessing a much smaller (but obviously still world-class!) team now has a shot at it, and if that is indeed what they're going for, then I could understand them perhaps being a bit coy.
  It's one heck of a crazy hard problem to tackle. It really depends on what levels of abstraction are targeted, in addition to how much one cares about existing languages and supporting infra.
  It's really nice to see a Rust-only shop, though.
  Edit: Turns out it helps to RTFA in its entirety:
  >>Our approach differs in two key ways. First, we target Rust's std directly rather than introducing a new GPU-specific API surface. This preserves source compatibility with existing Rust code and libraries. Second, we treat host mediation as an implementation detail behind std, not as a visible programming model.
  In that sense, this work is less about inventing a new GPU runtime and more about extending Rust's existing abstraction boundary to span heterogeneous systems.
  That last sentence is interesting in combination with this:
  >>Technologies such as NVIDIA's GPUDirect Storage, GPUDirect RDMA, and ConnectX make it possible for GPUs to interact with disks and networks more directly in the datacenter.
  Perhaps their modified std could enable distributed compute just by virtue of running on the GPU, so long as the GPU hardware topology supports it.
  Exciting times if some of the hardware and software infra largely intended for disaggregated inference ends up as a runtime for [compiled] code originally intended for the CPU.
  - spease11 days ago
    There was a library for Rust called “faster” which worked similarly to Rayon, but for SIMD.
    The simpleminded way to do what you’re saying would be to have the compiler create separate PTX and native versions of a Rayon structure, and then choose which to invoke at runtime.
    the__alchemist10 days ago
    Why past tense? I would use that if it truly acted like Rayon! I.e minimal friction.
- shihab11 days ago
  I work with GPUs and I'm also trying to understand the motivations here.
  Side note & a hot take: that sort of abstraction never really existed for GPU and it's going to be even harder now as Nvidia et al races to put more & more specialized hardware bits inside GPUs
codedokode11 days ago
I think it is possible to run CPU code on GPU (including the whole OS), because GPU has registers, memory, arithmetic and branch instructions, and that should be enough. However, it will be able to use only several cores from many thousands because GPU cores are effectively wide SIMD cores, grouped into the clusters, and CPU-style code would use only single SIMD lane. Am I wrong?
- bigyabai11 days ago
  Given enough time, we'll all loop back around to the Xeon Phi: https://en.wikipedia.org/wiki/Xeon_Phi
  - elromulous11 days ago
    It was ahead of its time!
    When I was in grad school I tried getting my hands on a phi, it seemed impossible.
    xmcqdpt210 days ago
    Xeon Phi was so cool. I wanted to use the ones we had so much... but couldn't find any applications that would benefit enough to make it worth the effort. I guess that's why it died lol.
- dancek11 days ago
  This seems correct to me. Of course you'd need to build a CPU emulator to run CPU code. A single GPU core is apparently about 100x slower than a single CPU core. With emulation a 1000x slowdown might be expected. So with a lot of handwaving, expect performance similar to a 4 MHz processor.
  Obviously code designed for a GPU is much faster. You could probably build a reasonable OS that runs on the GPU.
  - codedokode9 days ago
    You don't need an emulator, you can compile into GPU machine code.
- fulafel11 days ago
  GPUs having have thousands of cores is just a silly marketing newspeak.
  They rebranded SIMD lanes "cores". For eaxmple Nvidia 5000 series GPUs have 50-170 SMs which are the equivalent of cpu cores there. So a more than desktops, less than bigger server CPUs. By this math each avx-512 cpu core has 16-64 "gpu cores".
  - zozbot23410 days ago
    170 compute units is still a crapload of em for a non-server platform with non-server platform requirements. so the broad "lots of cores" point is still true, just highly overstated as you said. plus those cores are running the equivalent of n-way SMT processing, which gives you an even higher crapload of logical threads. AND these logical threads can also access very wide SIMD when relevant, which even early Intel E-cores couldn't. All of that absolutely matters.
  - saagarjha10 days ago
    Each SM can typically schedule 4 warps so it’s more like 400 “cores” each with 1024-bit SIMD instructions. If you look at it this way, they clearly outclass CPU architectures.
    fulafel10 days ago
    This level corresponds to SMT in CPUs I gather. So you can argue your 192 core EPYC server cpu has 384 "vCPUs" since execution resources per core are overprovisioned and when execution blocks waiting for eg memory another thread can run in its place. As Intel and AMD only do 2-way SMT this doesn't make the numbers go up as much.
    The single GPU warp is both beefier and wimpier than the SMT thread: they're in-order barely superscalar, whereas on CPU side it's wide superscalar big-window OoO brainiac. But on the other hand the SM has wider SIMD execution resources and there's enough througput for several warps without blocking.
    A major difference is how the execution resources are tuned to the expected workloads. CPU's run application code that likes big low latency caches and high single thread performance on branchy integer code, but it doesn't pay to put in execution resources for maximizing AVX-512 FP math instructions per cycle or increasing memory bandwidth indefinitely.
    saagarjha10 days ago
    Right, but the CPU does not have a matrix multiply unit or high bandwidth memory.
    fulafel10 days ago
    Yep. But from the point of view of running CPU-style code on GPUs (eg Rust std lib) and how the "thousands of cores" fiction relates those are less relevant.
    And for GenAI matrix math there's of course all the non-gpu acceleration features in various shapes and forms, like the on-chip edge tpu on G phones or Intel and Apple's name things that are both called AMX.
    9 days ago
    undefined
- JonChesterfield10 days ago
  Merely mislead by marketing. The x64 arch has 512bit registers and a hundred or so cores. The gpu arch has 1024bit registers and a few hundred SMs or CUs, being the thing equivalent to an x64 core.
  The software stacks running on them are very different but the silicon has been converging for years.
e12e10 days ago
There's indeed an actual toggle at the top to switch on pedantic mode, FYI.
But still a bit odd that the article doesn't show assembly/cuda/opencl output of the compiler - nor show an example of anything parallel - like maybe a vector search, Mandelbrot calculation or something like that?
Something like this 2019 article on cuda for Julia:
https://nextjournal.com/sdanisch/julia-gpu-programming
imtringued11 days ago
Considering that we live in the age of megakernels where the cost of CPU->GPU->CPU data transfer and kernel launch overhead are becoming ever bigger performance bottlenecks I would have expected more enthusiasm in this comment section.
Surely there is some value in the ability to test your code on the CPU for logic bugs with printf/logging, easy breakpoints, etc and then run it on the GPU for speed? [0]
Surely there is some value in being able to manage KV caches and perform continuous batching, prefix caching, etc, directly on the GPU through GPU side memory allocations?
Surely there is some value in being able to send out just the newly generated tokens from the GPU kernel via a quick network call instead of waiting for all sessions in the current batch to finish generating their tokens?
Surely there is some value in being able to load model parameters from the file system directly into the GPU?
You could argue that I am too optimistic, but seemingly everyone here is stuck on the idea of running existing CPU code without ever even attempting to optimize the bottlenecks rather than having GPU heavy code interspersed with less GPU heavy code. It's all or nothing to you guys.
[0] Assuming that people won't write GPU optimized code at all is bad faith because the argument I am presenting here is that you test your GPU-first code on the CPU rather than pushing CPU-first code on the GPU.
shmerl11 days ago
How different is it from rust-gpu effort?
UPDATE: Oh, that's a post from maintainers or rust-gpu.
jasfi11 days ago
Benchmarks would be nice to help understand the performance implications.
solaarphunk10 days ago
I've been building something similar (GPU-native OS research project) and wanted to share a mental model shift that unlocked things for me.
The question "why run CPU code on GPU when GPU cores are slower?" assumes you're running ONE program. But GPUs execute in SIMD wavefronts of 32 threads - and here's the trick: each of those 32 lanes can run a DIFFERENT process. Same instruction, different data. Calculator on lane 0, text editor on lane 1, file indexer on lane 2. No divergence, legal SIMD, full utilization. Suddenly you're not running "slow CPU code on GPU" - you're running 32 independent programs in parallel on hardware designed for exactly this pattern.
The win isn't throughput for compute-heavy code. It's eliminating CPU roundtrips for interactive stuff. Every kernel launch, every synchronization, every "GPU done, back to CPU, dispatch next thing" adds latency. A persistent kernel that polls for input, updates state, and renders - all without returning to CPU - changes the responsiveness equation entirely.
```
  A few things to try at home if you're curious:                                                          
                                                                                                          
```
1. Write a Metal/CUDA kernel with while(true) and an atomic shutdown flag. See how long it runs. (Spoiler: indefinitely, if you do it right)
2. Put 32 different "process states" in a buffer and have each SIMD lane execute instructions for its own process. Watch all 32 make progress simultaneously.
3. Measure the latency from "input event" to "pixel on screen" with CPU orchestration vs GPU polling an input queue directly. The difference surprised me.
The persistent kernel thing has a nasty gotcha though - ALL 32 threads must participate in the while loop. If you do if (tid != 0) return; then while(true), it'll work for a few million iterations then hard-lock. Ask me how I know.
- zozbot23410 days ago
  If you're running vastly different processes in different ALU lanes, the single master "program" that comprises them all is effectively an interpreter. And then it's hard to have the exact same control flow lead to vastly different effects in different processes, especially once you account for branches. This works well for inference batches since those are essentially about straight-line processing, but not much else.
- JonChesterfield10 days ago
  It'll go much faster if you give each process a warp instead of a thread. That means each process has its own IP and set of vector registers, and when your editor takes a different branch to your browser, no cost.
cong-or10 days ago
What’s the latency on a hostcall? A PCIe round-trip for something like File::open is fine—that’s slow I/O anyway. But if a println! from GPU code blocks on a host round-trip every time, that completely changes how you’d use it for debugging.
Is there device-side buffering, or does each write actually wait for the host?
brcmthrowaway11 days ago
Can I execute FizzBuzz and DOOM on GPU?
- Cieric11 days ago
  Well you could already do doom for about 6 months now [1]. I haven't tested the nvidia side, but it ran okay on my RX 7700S in my framework laptop.
  [1] https://github.com/jhuber6/doomgeneric