Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width.
Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole.
Compared to Huff0[1] (used by Zstd), my AVX512 code is currently ~40% faster at both compression and decompression. This requires using 32 datastreams instead of 4 used by Huff0.
For decode, do you use AVX512 to speedup the decode via caching the decode of small codewords?
Do you decode serially or use the self syncronizing nature of huffman codes to decode the stream from multiple offsets in parallel? I haven't seen the later done in SIMD before.
Are there any new SIMD instructions you'd like to see in future ISA extensions?
OpenPower has proposed a scalar instruction to speedup prefix-code decoding: https://libre-soc.org/openpower/prefix_codes/
Maybe you're remapping RGB values [0..255] with a tone curve in graphics, or doing a mapping lookup of IDs to indexes in a set, or a permutation table, or .. well, there's a lot of use cases, right? This is essentially an arbitrary function lookup where the domain and range is on bytes.
It looks like this in scalar code:
transform_lut(byte* dest, const byte* src, int size, const byte* lut) { for (int i = 0; i < size; i++) { dest[i] = lut[src[i]]; } }
The function above is basically load/store limited - it's doing negligible arithmetic, just loading a byte from the source, using that to index a load into the table, and then storing the result to the destination. So two loads and a store per element. Zen5 has 4 load pipes and 2 store pipes, so our CPU can do two elements per cycle in scalar code. (Zen4 has only 1 store pipe, so 1 per cycle there)
Here's a snippet of the AVX512 version.
You load the lookup table into 4 registers outside the loop:
__m512i p0, p1, p2, p3;
p0 = _mm512_load_epi8(lut);
p1 = _mm512_load_epi8(lut + 64);
p2 = _mm512_load_epi8(lut + 128);
p3 = _mm512_load_epi8(lut + 192);
Then, for each SIMD vector of 64 elements, use each lane's value as an index into the lookup table, just like the scalar version. Since we only can use 128 bytes, we DO have to do it twice, once for the lower and again for the upper half, and use a mask to choose between them appropriately on a per-element basis. auto tLow = _mm512_permutex2var_epi8(p0, x, p1);
auto tHigh = _mm512_permutex2var_epi8(p2, x, p3);
You can use _mm512_movepi8_mask to load the mask register. That instruction sets each lane is active if its high bit of the byte is set, which perfectly sets up our table. You could use the mask register directly on the second shuffle instruction or a later blend instruction, it doesn't really matter.For every 64 bytes, the avx512 version has one load&store and does two permutes, which Zen5 can do at 2 a cycle. So 64 elements per cycle.
So our theoretical speedup here is ~32x over the scalar code! You could pull tricks like this with SSE and pshufb, but the size of the lookup table is too small to really be useful. Being able to do an arbitrary super-fast byte-byte transform is incredibly useful.
func _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 {
var dst [64]uint8
for j := 0; j < 64; j++ {
i := idx[j]
src := a
if i&0b0100_0000 != 0 {
src = b
}
dst[j] = src[i&0b0011_1111]
}
return dst
}
Basically, for a lookup table of 8-bit values, you need only 1 instruction to perform up to 64 lookups simultaneously, for each 128 bytes of table.If you want to make 4 at a time though, you have to keep the thing fed. You need your ingredients in the cache, or you are just going to waste time finding them.
Quick reminder that a 20x boost is better than going from O(n) to O(log n) for up to a million items. And, that log n algorithms often are simply not possible for many problems.
I think he talked from personal experience.
Anyway, I do not think that even "typically" such statement can remotely be truth. It is 2 orders of magnitude away (20 to 5000).
But maybe the author simply made it up.
x=20log2(x) at 143
The width of the SIMD instructions is not visible when programming with NVIDIA CUDA or with the similar compilers for Intel CPUs (ispc or oneAPI with SYCL targeting CPUs or OpenMP with appropriate pragmas), but only because the compiler takes care of that.
There wasn't much appetite for any of it on Emscripten.
https://github.com/WebAssembly/wasi-libc/pulls?q=is%3Apr+opt...
My own fuzzing doesn't report any inconsistencies. But fuzzing is always necessarily incomplete.
PSHUFB wins in case of unpredictable access patterns. Though I don't remember how much it typically wins.
PMOVMSKB can replace several conditionals (up to 16 in SSE2 for byte operands) with only one, winning in terms of branch prediction.
PMADDWD is in SSE2, and does 8 byte multiplies not 4. SSE4.1 FP rounding that doesn't require changing the rounding mode, etc. The weird string functions in SSE4.2. Non-temporal moves and prefetching in some cases.
The cool thing with SIMD is that it's a lot less stress for the CPU access prediction and branch prediction, not only ALU. So when you optimize it will help unrelated parts of your code to go faster.
They were notable for several reasons, although they are no longer included in modern silicon.
Of course memory bandwidth should increase proportionally otherwise the cores might have no data to process.
But 128bit is just ancient. If you're going to go to significant trouble to rewrite your code in SIMD, you want to at least get a decent perf return on investment!
That is just a software view provided by the CUDA compiler and the NVIDIA device driver, which take care of distributing the computation over SIMD lanes and real hardware threads (called warps in the NVIDIA obfuscating jargon).
The GPU processor is just a processor with SIMD and FGMT (fine-grained multi-threading). There is nothing special from this point of view.
For any CPU you can write a compiler that provides the "SIMT" software model and there are such compilers for the Intel/AMD CPUs.
SIMT is just one of many useless synonyms created by NVIDIA for traditional terms previously used for decades in computing publications, i.e. it is the same with what Hoare called "array of processes" (in 1978-08, later implemented in the language Occam) and actually the same with "parallel for" of OpenMP, just with a slightly different syntax (i.e. CUDA separates in the source file the header and the body of the parallel FOR).
It's less a difference in instruction set capability and more a difference in mentality.
Like, for SIMD, you have to say "ok, we're working in vector land now" and start doing vector loads into vector registers to do vector ops on them. Otherwise, the standard variables your program uses are scalars and you get less parallelism. On a GPU this is flipped: the regular registers are vector, and the scalar ones (if you have any) are the weird ones. And because of this the code you write is (more or less) scalar code where everything so happens to magically get done sixteen times at once.
As you can imagine, this isn't foolproof and there's a lot of other things that have to change on GPU programming in order to be viable. Like, conditional branching has to be scalar, since the instruction pointer register is still scalar. But you can have vectors of condition flags (aka "predicates"), and make all the operations take a predicate register to tell which specific lanes should and shouldn't execute. Any scalar conditional can be compiled into predicates, so long as you're OK with having to chew through all instructions on both branches[0].
[0] A sufficiently smart shader compiler could check if the predicate is all-false or all-true and do a scalar jump over the instructions that won't execute. Whether or not that's a good idea is another question.
Now whether this is actually how the hardware operates or whether the compiler in the GPU driver turns the SIMT code into something like SIMD code for the actual HW is another question.
Why do we even need SIMD instructions? - https://news.ycombinator.com/item?id=44850991 - Aug 2025 (8 comments)
>Yet this new family of instructions was able to describe a great deal more work per instruction – the key benefit of SIMD.
It doesn't really explain why it is a benefit. From a ALU perspective, each SIMD lane could in theory execute a different operation. So why would simultaneous instructions as in SIMD ever be a thing? You could just keep cranking up the ILP (instruction level parallelism) by adding more ALUs, naturally extending to MIMD. Each SIMD lane would be equivalent to a full CPU core, so why is everyone stopping short of unlocking even more performance?
Because instruction memory (SRAM) is incredibly expensive.
By using SIMD you are reducing the number of instructions needed to describe repeating calculations. If you work with 32-bit floats and 512-bit instructions representing 16 lanes, you've made a 16-fold reduction to the required quantity of instruction memory and subsequent fetches from that memory to express the same parallel computation compared to using ILP.
Modern compilers are able sometimes to vectorize regular code, but this is done only occasionally, since compilers can't often prove that read/write operations will access valid memory regions. So one still needs to write his code in such a way that compiler can vectorize it, but such approach isn't reliable and it's better to use SIMD instruction directly to be sure.
They have been in the CPUs for so long that I expected them to be inseparable to the degree that people wouldn't even remember they were a separate thing in the past.