Edit: learned a bunch, but the "uniform" registers and 64-bit (memory) performance are some easy standouts.
GPUs can certainly do bulk integer arithmetic but most use cases prefer FP. Maybe for DSP fixed-point is ideal.
For GEMM you need to visit each row/vec n-times so theres a bunch of data-reuse going on, which isn't optimal for GPUs since you can't keep that all so close to your processing-units. And while the tensor-cores kinda implement this i think they don't quite scale up to a full sized systolic array, which is you would want for larger matrix multiplications.
Also just a simpler view: with GPUs most of their silicon is spent NOT tensor-core, so just from that you know its not optimal i guess.
Just referring to that FLOP/s number doesn't really mean much nowadays with tensor-cores and sparsity.
In my eyes the big win of GPUs are that not only are they pretty good at GEMMs but also really good at a lot of other easily parallelizable tasks PLUS they're comparatively easy to program ^^
I try and use tensor cores for non-obvious things every now and then. The most promising so far seems to be for linear arithmetic in Datalog, but that's just matrix-vector/gemv
Other forms of sparsity are heavily used at training time now, like block compression in Deepseek.
"uniform registers" exist for about 20 years now.
maybe some vendors have had an equivalent to uniform registers for 20 years, but per the articles’ references they are new in nvidia GPUs in turing (2018)
I don't know what Nvidia did in 2018, maybe they opened up access to the uniform registers to CUDA code.
I made Grok research this topic:
> In conclusion, research strongly suggests that the "uniform" keyword in GLSL is implemented in hardware using NVIDIA's "uniform registers," as evidenced by NVIDIA's own documentation on the Turing architecture and historical practices of mapping uniforms to constant registers. While explicit links can be limited due to proprietary details, the combination of technical presentations, community discussions, and historical context supports this connection. The uniform register file, with its capacity and usage in shader instructions, aligns with GLSL's uniform functionality, ensuring efficient data access during shader execution.
https://grok.com/share/c2hhcmQtMg%3D%3D_358362f3-21e2-4fe0-a...
Unfortunately that's already behind the latest GPU by two generations. You'd have these after A6000: 6000 Ada, Pro 6000.
I bet I can do more CUDA with my lame GeForce MX 150 from 2017, than what most people can reach for to do ROCm, and that is how NVidia keeps being ahead.
Because that is part of my point, that is a laptop GPU.
A6000 was released in 2020: https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686
I bet there are plenty of papers out there claiming to have used a RTX 6000 instead of a RTX 6000 Ada gen.
To understand this, consider these names in the order of release time: Quadro RTX 6000, RTX A6000, RTX 6000 Ada, RTX Pro 6000, RTX Pro 6000 Max-Q.
New architectures rely on the compiler to handle register data dependencies, and controlling register file cache allocation policy.
You are probably thinking VLIW like Intels Itanium, and Transmeta. Those architectures required really smart compiler for scheduling and it was a bust.
Nvidia GPU's need smart compiler and it works because the task is limited to optimizing numerical pipelines that are 99% matrix multiplication, dot products. The data movement is more predicable. Compilers know how the data will be used and know how to schedule.
Doing the work in the compiler may produce less optimal scheduling than what is theoretically possible, but with the number of "cores" in a GPU you would spend a lot of power doing it in hardware for each one.
> "GPUs leverage hardware-compiler techniques where the compiler guides hardware during execution."
I don't think GPU utilization is a real bottleneck in most cases.