Shaders: the shader compiler does go to great lengths to avoid register bank conflicts of all kinds.
Hardware: you generally don't want to have that much complexity in a GPU pipeline, especially not in the front end -- these are very simply machines, no OpO execution, very little prediction, and so their power comes from width, not depth. If you make each core a little more complex, you lose out because you now need to spend area that would have otherwise been spent on more compute units.
This exists because the GPU assembly programmer apparently wishes to provide a performance guarantee of some kind.
I believe they are using this to store the BVH traversal stack in registers, with allocation size depending on traversal depth. Eventually all the rays in the wave will hit or miss and free those registers.
Busy waiting is only the naive solution. I bet they went with the software based solution so they can experiment with other approaches. Such as dynamically switching to the previous approach of a stack in shared-memory when registers run out. Or maybe they can sort the rays, moving the long-running rays out, grouping them together into waves dedicated for deep traversal.
AMD's deep dive [1] says register use is low during BVH traversal, so they are still using shared memory for the stack. According to the ISA doc [2] they have a special mode for the stack that overwrites older entries instead of overflowing and a fallback stackless mode.
So the register allocation is actually increased when invoking the hit shader, allowing each hit shader to use a different number of VGPRs, spinning is probably the most logical approach. They probably don't even need the deadlock avoidance mode with their current implementation, which seems to allocate the full set of registers when calling the hit shader and only free them at the end.
[1] https://www.techpowerup.com/review/amd-radeon-rx-9070-series...
[2] See 12.5.3. DS Stack Operations for Ray Tracing - https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
If the next assembly line says to use register#200 (but you failed because you only have 160 registers), what's the processor supposed to do?
The only hope is to call for more registers and lock until those registers are available. There's no way to progress otherwise.
I'm having trouble with the tradeoff between compiler complexity and perceived value of userspace scheduling on gpus so a means of moving work from compiler into runtime helps lower the barrier to landing the implementation.
Oh, or with the work hat on, function pointers are a nuisance with kernel allocated resources, and a jump table keyed off the result of a register file size query would be a partial solution to that.
You might get similar effects with work graphs, CUDA dynamic parallelism or whatnot. But most coroutine patterns have significant branch divergence that probably is grossly inefficient in GPU land.
The GPU equivalent to coroutines would be to save off a continuation structure to a append buffer, and then startup a new GPU call with that buffer now turned into a consume buffer (which is how Raytracing circa 2018ish worked in Blender, before deficated Raytracing hardware was a thing)
You need to consolidate all of the divergent threads and coalesce them into converged threads to benefit from SIMD architecture.
-------
EDIT: Remember that in a 1920 x 1080p pixel screen has 2073600 pixels, which becomes 2073600 threads for 1-ray-per-pixel typical GPU implementation.
With so many threads, it doesn't make sense to "immediately" call coroutines. Instead, it makes far more sense to consolidate and coalesce threads together (ex: group up "hit" raytraces vs "miss" raytraces, which are handled differently).
Coroutines aren't currently a thing on amdgpu but I think they should be.
Even on NVidia, you're "allowed" to diverge and converge. But its not efficient.
Optimal NVidia coding will force more convergence than divergence. That's innate to GPU architecture. Its more efficient to run 32-at-a-time per NVidia warp, than a diverged 8-at-a-time warp.
Yes, NVidia _CAN_ diverge and properly execute a subwarp of 8-at-a-time per clocktick... including with complex atomics and all that. But running a full 32-at-a-time warp is 400% the speed because its ALWAYS better to do more per clock tick than less-per-clock tick.
Of course, we are on a GPU here. Skimming through the RDNA 4 ISA document [1], looks like they expect waves using the dynamic VGPRs features to keep track of how many VGPRs they have requested in a Scalar register (SGPR). Each wave always has 106 SGPRs allocated (32-bit each), so the overhead of a single SGPR isn't an issue. You wouldn't even need the jump table, as AMD allows you to index into the VGPR register file. A small loop would allow you to push all active VGPRs onto the stack.
Shame they don't also allow you to index into the SGPR registers too.
But GPUs aren't really designed with coroutines or context switching in mind. Modern GPUs can do it, but GPUs have deep historical assumptions that every single wave on every single core will be running the exact same code, and that 99.99% of waves will run to completion without needing a context switch. When preemptive context switches are needed (no more than once every few ms, and only with long-running compute shaders), they can probably get away with simply DMAing the whole register file to memory.
Though, part of the reason this feature was implemented is that the historic assumptions are no-longer true. DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.
[1] https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
That's not how it works in practice. Even with hardware accelerated raytracers (like Intel Arc).
AMD systems push the hit/miss onto various buffers and pass them around.
Intel systems push the entire call-stack and shuffles them around.
Lets say your 256 thread-group chunk has 30% "metalic hits", 15% "diffuse hits", and the remaining 55% are misses. You cannot "build up" a new group with just one thread-group (!!!!).
To efficiently run things, you'll need ~4 thread groups (aka: 1024 rays) before you can run a full 256-thread group again for hits, and you'll need ~2 thread-groups (aka: 512 rays) before you get a full 256-thread group again for misses. And finally you'll need ~7 thread-groups (aka: 1792 rays to pass through) before you have the 256-diffuse hits needed to fill up a SIMD Unit.
In all cases, you need to dynamically grow a buffer and "build up" enough parallelism before running the recursive hits (or miss) handlers. The devil is in the details.
Intel has very dedicated and specialized accelerators that moves the stacks around (!!!!) so that all groups remain fully utilized. I believe AMD's implementation is "just" an append buffer followed by a consume buffer, simple enough really. Probably inside of shared memory but who knows what the full implementation details are. (The AMD systems have documented ISAs so we know what instructions are available. AMD's "raytracers" are BVH tree traversal accelerators but don't seem to have stack-manipulation or stack-movements like Intel's raytracing implementation)
Register numbers aren't dynamically indexed, they're baked into the instruction stream, so there's not really any new information you can't get from the instruction pointer and a static (compiler-generated) lookup table?
On Nvidia you have to be writing in the PTX language to have access to something similar.
> Allocation requests don’t always succeed. s_alloc_vgpr sets the Scalar Condition Code (SCC) to indicate success, or clears it on failure.
So, you can do a binary search:
- set result to zero.
- try to allocate 512. If that succeeds, add 512 to result
- try to allocate 256. If that succeeds, add 256 to result
- try to allocate 128. If that succeeds, add 128 to result
...
- try to allocate 1. If that succeeds, add 1 to result (you’ll want to stop at the granularity at which registers are allocated)
- free ’result’ registers
(This isn’t race-free)
1. Registers are not allocated individually, but in clusters of 32, and...
2. The max registers a single thread can access is 8 clusters (256 registers). The minimum is 1 cluster.
You don't really need much of a binary search for this.
Whatever plan you have needs to be far faster than L1 cache to even make sense. I can imagine some very heavy register-pressure situations where registers are much better than cache accesses.