So, how large do you make your threadblocks to get optimal SM/warp scheduling? Well it "depends" based on resource usage, divergence, etc. Basically run it, profile, switch the threadblock size, profile again, etc. Repeat on every GPU/platform (if you're programming for multiple GPU platforms and not just CUDA, like games do). It's a huge pain, and very sensitive to code changes.
People new to GPU programming ask me "how big do I make the threadblock size?" and I tell them go with 64 or 128 to start, and then profile and adjust as needed.
Two articles on the AMD side of things:
https://gpuopen.com/learn/occupancy-explained
https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...
There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).
You were taught wrong...
First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".
Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:
Amount of shared memory used per warp
* Makeup of the instructions to be executed
* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).
I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?
This is true, but after they've been issued, it still takes a while for the execution to conclude.
> it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once
Just replace "most" with "some". It really depends on what kind of kernel you're writing.
https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming...
IIUC, even CuBLAS basically just uses a bunch of heuristics that are mostly derived from benchmarking to decide with kernels to use.
Optimization is very often like that. Making things generic, uniform and simple typically has a performance penalty - and you use your GPU because you care about that stuff.
Someone correct me if I'm wrong, maybe drivers don't do this anymore.
Right now NNs and their workloads are changing quickly enough that people tend to prefer runtime optimization (like the dynamic/JIT compilation provided by Torch's compiler), but when you're confident you understand the workload and have the know-how, you can do static compilation (e.g. with ONNX, TensorRT).
I work on a serverless infrastructure product that gets used for NN inference on GPUs, so we're very interested in ways to amortize as much of that compilation and configuration work as possible. Maybe someday we'll even have something like what Redshift has in their query engine -- pre-compiled binaries cached across users.
Should we go back to FORTRAN?
There are dozens of scientific papers and active research is still being done [1].
I've worked on automatic parallel runtime optimizations and adaptive compilers since 1981. We make reconfigurable hardware (chips and wafers) that also optimises at runtime.
Truffle/GraalVM is very rigid and overly complicated [6].
With a meta compiler like Ometa or Ohm we can give any programming language the runtime adaptive compilation for GPUs [3][4].
I'm currently adapting my adaptive compiler to Apple Silicon M4 GPU and neural engine to unlock the trillions of operations per second these chips can do.
I can adapt them to more NVIDIA GPUs with the information of the website in the title. Thank you very much charles_irl! I would love to be able to save the whole website in a single PDF.
I can optimise your GPU software a lot with my adaptive compilers. It will cost less than 100K in labour to speed up your GPU code by a factor 4-8 at least, sometimes I see 30-50 times speedup.
[1] https://www.youtube.com/watch?v=wDhnjEQyuDk
[2] https://www.youtube.com/watch?v=CfYnzVxdwZE
[3] https://tinlizzie.org/~ohshima/shadama2/
[4] https://github.com/yoshikiohshima/Shadama
I have a bit of a chip on my shoulder here, since I've been trying to pitch my Modern C++ API wrappers to them for years, and even though I've gotten some appreciative comments from individuals, they have shown zero interest.
https://github.com/eyalroz/cuda-api-wrappers/
There is also their driver, which is supposedly "open source", but actually none of the logic is exposed to you. Their runtime library is closed too, their management utility (nvidia-smi), their LLVM-based compiler, their profilers, their OpenCL stack :-(
I must say they do have relatively extensive documentation, even if it doesn't cover everything.
Then 3Dfx was smashed from the inside and bought out by nVidia. Source code to 3D accellerator hardware drivers never to be seen again.
Why? Because if just anybody could port 3D graphics hardware and drivers to any custom hardware and OS platform, then Microsoft, Apple, etc would rapidly be killed by something with a MUCH better GUI (3D) appearing on the market.
The powers that be do NOT want capable, unchained computing systems to upset their carefully constructed 'enslavement via enshitification' schemes.
What? Mesa supports plenty of hardware.
Thanks for sharing it.
[1] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...
For example, "compute capability" sounds like it'd be what you need, but it's actually more of a software versioning index :(
Was thinking of splitting the difference by collecting up the quoted arithmetic (FLOP/s) and memory bandwidths from the manufacturer datasheets. But there's caveats there too, e.g. the dreaded "With sparsity" asterisk on the Tensor Core FLOP/s of recent generations.
Any chance you could just make it a single long webpage (as opposed to making me click through one page at a time)?
For some reason on my iPad the links don’t always work the first time I click them.
"These groups of threads, known as warps , are switched out on a per clock cycle basis — roughly one nanosecond. CPU thread context switches, on the other hand, take few hundred to a few thousand clock cycles"
I would note that intels SMT does do something very similar (2 hw threads). Other like the xeon phi would round robin 4 threads on a single core.
SMT allows for concurrent execution of both threads (thus independent front-end for fetch, decode especially) and certain core resources are statically partitioned unlike a warp being scheduled on SM.
I'm not a graphics expert but warps seem closer to run-time/dynamic VLIW than SMT.
This mapping is so close that translation from GPU to CPU relatively easy and performant.
> intels SMT does do something very similar (2 hw threads)
Yeah that's a good point. One thing I learned from looking at both hardware stacks more closely was that they aren't as different as they seem at first -- lots of the same ideas or techniques get are used, but in different ways.
Now that so many more people are running workloads, including critical ones, on GPUs, it feels much more important that a base level of knowledge and intuition is broadly disseminated -- kinda like how most engineers basically grok database index management, even if they couldn't write a high-performance B+ tree from scratch. Hope this document helps that along!
Do you have any script blockers, browser cache settings, or extensions that might mess with navigation?
> would also like to see a PDF that has all the text in one place, presented linearly
Yeah, good idea! I think a PDF with links so that it's still easy to cross-reference terms would get the best of both worlds.
Edit: I just checked again and it didn't load at all... I also see this is on the front page again, at 5:30pm Eastern US time :) Probably HN hug of death.