2 pointsby matt_d4 hours ago2 comments

matt_d2 hours ago
Some highlights (by Stuart Sul):
> Tensor core and memory pipelining: it turns out some tensor core instructions are implicitly pipelined, without proper documentation. Identifying these implicit semantics and the resulting pipelining tactics can boost your throughput by up to 10%.
> Hinting the PTX assembler properly: even logically identical PTX code can compile into meaningfully different SASS instructions, depending on how you write it. Signaling the assembler with the right instruction patterns is significant for minimizing latency.
> Occupancy: with all the modern GPU features, it gets tricky, and it is (again) poorly documented. Distributed shared memory doesn’t behave identically across all SMs, and 5th-generation tensor core instructions silently cap occupancy.
maralom3 hours ago
> We had read the PTX documentation countless times, but only at this point did we realize that tcgen05.copy was a typo of tcgen05.cp; the hypothetical tcgen05.copy instruction never appears again anywhere in the document.
Wow