But still a bit odd that the article doesn't show assembly/cuda/opencl output of the compiler - nor show an example of anything parallel - like maybe a vector search, Mandelbrot calculation or something like that?
Something like this 2019 article on cuda for Julia:
That is, where does it truly make a difference to dispatch non-parallel/syscalls etc from GPU to CPU instead of dispatching parallel part of a code from CPU to GPU?
From the "Announcing VectorWare" page:
> Even after opting in, the CPU is in control and orchestrates work on the GPU.
Isn't it better to let CPUs be in control and orchestrate things as GPUs have much smaller, dumber cores?
> Furthermore, if you look at the software kernels that run on the GPU they are simplistic with low cyclomatic complexity.
Again, there's a obvious reason why people don't put branch-y code on GPU.
Genuinely curious what I'm missing.
I need the heights on the GPU so I can modify the terrain meshes to fit the terrain. I need the heights on the CPU so I can know when the player is clicking the terrain and where to place things.
Rather than generating a heightmap on the CPU and passing a large heightmap texture to the GPU I have implemented the identical height generating functions in rust (CPU) and webgl (GPU). As you might imagine, its very easy for these to diverge and so I have to maintain a large set of tests that verify that generated heights are identical between implementations.
Being able to write this implementation once and run it on the CPU and GPU would give me much better guarantees that the results will be the same. (although necause of architecture differences and floating point handling they the results will never be perfect, but I just need them to be within an acceptable tolerance)
If you wrote in open cl, or via intel libraries, or via torch or arrayfire or whatever, you could dispatch it to both CPU and GPU at will.
Dispatch has overheads, but it's largely insignificant. Where it otherwise would be significant:
1. Fused kernels exist
2. CUDA graphs (and other forms of work-submission pipelining) exist
Cuda can also do C++ new, delete and virtual functions and exception handling and all the rest. And if you use that stuff, you're basically making an aeroplane flap its wings, with all the performance implications that come with such an abomination.
inb4 these guys start running Python and Ruby on GPU for "speed", and while they're at it, they should send a fax to Intel and AMD saying "Hey guys, why do you keep forgetting to put the magic go-fast pixie dust into your CPUs, are you stupid?"
this is really just leveraging Rust's existing, unique fit across HPC/numerics, embedded programming, low-level systems programming and even old retro-computing targets, and trying to expand that fit to the GPU by leveraging broad characteristics that are quite unique to Rust and are absolutely relevant among most/all of those areas.
The real GPU pixie dust is called "lots of slow but efficient compute units", "barrel processing", "VRAM/HBM" and "non-flat address space(s) with explicit local memories". And of course "wide SIMD+SPMD[0]" which is the part you already mentioned and is in fact somewhat harder to target other than in special cases (though neural inference absolutely relies on it!). But never mind that. A lot of existing CPU code that's currently bottlenecked on memory access throughput will absolutely benefit from being seamlessly ran on GPU.
[0] SPMD is the proper established name for what people casually call SIMT
The code is running on the gpu there. It looks like remote calls are only for "IO", the compiled stdlib is generally running on gpu. (Going just from the post, haven't looked at any details)
Obviously code designed for a GPU is much faster. You could probably build a reasonable OS that runs on the GPU.
When I was in grad school I tried getting my hands on a phi, it seemed impossible.
They rebranded SIMD lanes "cores". For eaxmple Nvidia 5000 series GPUs have 50-170 SMs which are the equivalent of cpu cores there. So a more than desktops, less than bigger server CPUs. By this math each avx-512 cpu core has 16-64 "gpu cores".
e.g. this code seems like it would entirely run on the CPU?
print!("Enter your name: ");
let _ = std::io::stdout().flush();
let mut name = String::new();
std::io::stdin().read_line(&mut name).unwrap();
But what if we concatenated a number to the string that was calculated on the GPU or if we take a number: print!("Enter a number: ");
[...] // string number has to be converted to a float and sent to the GPU
// Some calculations with that number performed on the GPU
print!("The result is: " + &the_result.to_string()); // Number needs to be sent back to the CPU
Or maybe I am misunderstanding how this is supposed to work?I once wrote a prototype async IO runtime for GLSL (https://github.com/kig/glslscript), it used a shared memory buffer and spinlocks. The GPU would write "hey do this" into the IO buffer, then go about doing other stuff until it needed the results, and spinlock to wait for the results to arrive from the CPU. I remember this being a total pain, as you need to be aware of how PCIe DMA works on some level: having your spinlock int written to doesn't mean that the rest of the memory write has finished.
In the end, people program for GPUs not because it's more fun (it's not!), but because they can get more performance out of it for their specific task.
- file system
- network interfaces
- dates/times
- Threads, e.g. for splitting across CPU cores
The main relevant one I can think which applies is an allocator.I do a lot of GPU work with rust: Graphics in WGPU, and Cuda kernels + cuFFT mediated by Cudarc (A thin FFI lib). I guess, running Std lib on GPU isn't something I understand. What would be cool is the dream that's been building for decades about parallel computing abstractions where you write what looks like normal single-threaded CPU code, but it automagically works on SIMD instructions or GPU. I think this and CubeCL may be working towards that? (I'm using Burn as well on GPU, but that's abstracted over)
Of note: Rayon sort of is that dream for CPU thread pools!
I've had that same dream at various points over the years, and prior to AI my conclusion was that it was untenable barring a very large, world-class engineering team with truckloads of money.
I'm guessing a much smaller (but obviously still world-class!) team now has a shot at it, and if that is indeed what they're going for, then I could understand them perhaps being a bit coy.
It's one heck of a crazy hard problem to tackle. It really depends on what levels of abstraction are targeted, in addition to how much one cares about existing languages and supporting infra.
It's really nice to see a Rust-only shop, though.
Edit: Turns out it helps to RTFA in its entirety:
>>Our approach differs in two key ways. First, we target Rust's std directly rather than introducing a new GPU-specific API surface. This preserves source compatibility with existing Rust code and libraries. Second, we treat host mediation as an implementation detail behind std, not as a visible programming model.
In that sense, this work is less about inventing a new GPU runtime and more about extending Rust's existing abstraction boundary to span heterogeneous systems.
That last sentence is interesting in combination with this:
>>Technologies such as NVIDIA's GPUDirect Storage, GPUDirect RDMA, and ConnectX make it possible for GPUs to interact with disks and networks more directly in the datacenter.
Perhaps their modified std could enable distributed compute just by virtue of running on the GPU, so long as the GPU hardware topology supports it.
Exciting times if some of the hardware and software infra largely intended for disaggregated inference ends up as a runtime for [compiled] code originally intended for the CPU.
The simpleminded way to do what you’re saying would be to have the compiler create separate PTX and native versions of a Rayon structure, and then choose which to invoke at runtime.
Side note & a hot take: that sort of abstraction never really existed for GPU and it's going to be even harder now as Nvidia et al races to put more & more specialized hardware bits inside GPUs
Is there device-side buffering, or does each write actually wait for the host?
UPDATE: Oh, that's a post from maintainers or rust-gpu.
Surely there is some value in the ability to test your code on the CPU for logic bugs with printf/logging, easy breakpoints, etc and then run it on the GPU for speed? [0]
Surely there is some value in being able to manage KV caches and perform continuous batching, prefix caching, etc, directly on the GPU through GPU side memory allocations?
Surely there is some value in being able to send out just the newly generated tokens from the GPU kernel via a quick network call instead of waiting for all sessions in the current batch to finish generating their tokens?
Surely there is some value in being able to load model parameters from the file system directly into the GPU?
You could argue that I am too optimistic, but seemingly everyone here is stuck on the idea of running existing CPU code without ever even attempting to optimize the bottlenecks rather than having GPU heavy code interspersed with less GPU heavy code. It's all or nothing to you guys.
[0] Assuming that people won't write GPU optimized code at all is bad faith because the argument I am presenting here is that you test your GPU-first code on the CPU rather than pushing CPU-first code on the GPU.