The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.
You can treat both SVE and RVV as a regular fixed-width SIMD ISA.
"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.
If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.
Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time.
ARM seems to be proposing a C language extension which does require compilers to support variably sized types, but it's not clear to me how the implementation of that is going, and equivalent support in other languages like Rust seems basically non-existent for now.
Yes, you can't, which is annoying, but you can if you compile for a specific vector length.
This is mostly a library structure problem. E.g. simdjson has a generic backend that assumes a fixed vector length. I've written fixed width RVV support for it. A vector length agnostic backend is also possible, but requires writing a full new backend. I'm planning to write it in the future (I alreasy have a few json::minify implementations), but it will be more work. If the generic backend used a SIMD abstraction, like highway, that support scalable vectors this wouldn't be a problem.
Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.
SVE theoretically supports hardware up to 2048-bit, so conservatively reserving the worst-case size at compile time would be pretty wasteful. That's 16x overhead in the base case of 128-bit hardware.
https://ashvardanian.com/posts/aws-graviton-checksums-on-neo...
Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast.
Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.
The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.
See also: https://forum.spacemit.com/uploads/short-url/60aJ8cYNmrFWqHn...
But yes, RVV already has more diverse vector width hardware than SVE.
Very wide SIMD instructions require a lot of die space and a lot of power.
The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area (Source https://chipsandcheese.com/p/knights-landing-atom-with-avx-5... which is an excellent site for architectural analysis)
Most ARM desktop/mobile parts are designed to be low power and low cost. Spending valuable die space on large logic blocks for instructions that are rarely used isn't a good tradeoff for consumer apps.
Most ARM server parts are designed to have very high core counts, which requires small individual die sizes. Adding very wide SIMD support would grow die space of individual cores a lot and reduce the number that could go into a single package.
Supporting 256-bit or 512-bit instructions would be hard to do without interfering with the other design goals for those parts.
Even Intel has started dropping support for the wider AVX instructions in their smaller efficiency cores as a tradeoff to fit more of them into the same chip. For many workloads this is actually a good tradeoff. As this article mentions, many common use cases of high throughput SIMD code are just moving to GPUs anyway.
Buy new chips next year! Haha :)
In Apple's case, they have both the GPU and the NPU to fall back on, and a more closed/controlled ecosystem that breaks backwards compatibility every few years anyway. But Qualcomm is not so lucky; Windows is far more open and far more backwards compatible. I think the bet is that there are enough users who don't need/care about that, but I would question why they would even want Windows in the first place, when macOS, ChromeOS, or even GNU/Linux are available.
Also, it doesn't just speed up vector math. Compilers these days with knowledge of these extensions can auto-vectorize your code, so it has the potential to speed up every for-loop you write.
So operations that are not performance critical and are needed once or twice every hour? Are you sure you don't want to include a dedicated cluster of RTX 6090 Ti GPUs to speed them up?
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...
My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.
SIMD is not limited to mathy linear algebra things anymore. Did you know that lookup tables can be accelerated with AVX2? A lot of branchy code can be vectorized nowadays using scatter/gather/shuffle/blend/etc. instructions. The benefits vary, but can be significant. I think a view of SIMD as just being a faster/wider ALU is out of date.
So I'd bet the issue is either newness of the codebase, as the article suggests, or perhaps that it is harder to schedule the work in 256 bit chunks than 128. It's got to be easier when you've got more than enough NEON q registers to handle the xmms, harder when you've got only exactly enough to pair up for handling ymms?
That would be plain AVX, AVX2 has shuffles across the 128-bit boundary. To me that seems like the main hurdle for emulation with 128-bit vectors, in my experience compilers are very eager to emit shuffle instructions if allowed, and emulating a 256-bit shuffle with 128-bit operations would require 2 shuffles and a blend for each half of the emulated register.
The way that the vector registers were extended to 256-bit causes problems when legacy 128-bit and 256-bit ops are mixed. Doing so puts the CPU into a mode where all legacy 128-bit ops are forced to blend the high half, which can reduce throughput of existing SSE2-based library routines to as low as 1/4 throughput. For this reason, AVX code has to aggressively use the VZEROUPPER instruction to ensure that the CPU is not left in AVX 256-bit vector mode before possibly returning to any library or external code that uses SSE2. VZEROUPPER sets a flag to zero the high half of all 256-bit registers, so it's cheap on modern x86 CPUs but can be expensive to emulate without hardware support.
The other problem is that only the low 128 bits of vector registers are preserved across function calls due to the Windows x64 calling convention and the VZEROUPPER issue. This means that practically any call to external code forces the compiler to spill all AVX vectors to memory. Ideally 256-bit vector usage is concentrated in leaf routines so this isn't an issue, but where used in non-leaf routines, it can result in a lot of memory traffic.
That set matches the x86-64-v2 x64 microarchitecture level. Most of the articles uses 'v2' or 'v3' or 'x86-64-v2', but I thought that more people would be familiar with the names of the instruction sets than that x64 was versioned. The versions only appeared quite recently (2020) and are rather retroactive.
> Yes, it is absolutely key to build your app as ARM, not to rely on Windows ARM emulation.
Using AVX2 and using an emulator have contradictory goals. Of course there can be a better emulator or actually matching hardware design (since both Apple and Microsoft actually exploit the similar register structure between ARM64 and x86_64). However, this means you have increased complexity and reduced reliability / predictability.
I put a spoiler at the top too, to avoid trying to make people read the whole thing. The real bit is that chart, which I think is quite an amazing result.
You're right re building. We're a compiler vendor, so we have a natural interest in what people should be targeting. But even for us the results here were not what we expected ahead of time.
Most of the world lives of 300$ per month
Other Settings > AVX2 > 95.11% supported (+0.30% this month)
Maybe you're thinking of avx512 or avx10?
AVX512 is so kludgy that it usually leads to a detriment in performance due to the extreme power requirements triggering thermal throttling.
So, in order to make use of users new fancy hardware without abandoning other users old and busted hardware, you have to support multiple back-ends. Same as it ever was.
Actually, a lot easier than it ever was today. Doom 3 famously required Carmack to reimplement the rendering 6 times to get the same results out of 6 different styles of GPUs that were popular at the time.
ARB Basic Fallback (R100) Multi-pass Minimal effects, no specular.
NV10 GeForce 2 / 4 MX, 5 Passes, Used Register Combiners.
NV20 GeForce 3 / 4 Ti, 2–3 Passes, Vertex programs + Combiners.
R200 Radeon 8500–9200, 1 Pass, Used ATI_fragment_shader.
NV30 GeForce FX Series, 1 Pass, Precision optimizations (FP16).
ARB2 Radeon 9500+ / GF 6+, 1 Pass, Standard high-end GLSL-like assembly.
But there's plenty in avx512 the really helps real algorithms outside the 512-wide registers - I think it would be perceived very differently if it was initially the new instructions on the same 256-wide registers - ie avx10 - in the first place, then extended to 512 as the transistor/power budgets allowed. AVX512 was just tying too many things together too early than "incremental extensions".
AVX512 leading to thermal throttling is a common myth that from what I can tell traces its origins to a blog post about clock throttling on a particular set of low-TDP SKUs from the first generation of Xeon CPUs that supported it (Skylake-X), released over a decade ago: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...
The results were debated shortly after that by well-known SIMD authors that were unable to duplicate the results: https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-i...
In practice, this has not been an issue for a long time, if ever; clock frequency scaling for AVX modes has been continually improved in subsequent Intel CPU generations (and even more so in AMD Zen 4/5 once AVX512 support was added).
All AMD Zen 4 and Zen 5 and all of the Intel CPUs since Ice Lake that support AVX-512, benefit greatly from using it in any application.
Moreover the AMD Zen CPUs have demonstrated clearly that for vector operations the instruction-set architecture really matters a lot. Unlike the Intel CPUs, the AMD CPUs use exactly the same execution units regardless whether they execute AVX2 or AVX-512 instructions. Despite this, their speed increases a lot when executing programs compiled for AVX-512 (in part for eliminating bottlenecks in instruction fetching and decoding, and in part because the AVX-512 instruction set is better designed, not only wider).
Edit. Furthermore, i think that none of these (pre-2020) low budget CPUs support AVX2, until Tiger lake released in 2020.
Edit: That's wrong. Jasper Lake from 2021 also came without AVX support.