> The other drawback of this method is that the optimizer won’t even touch anything involving floats (f32 and f64 types). It’s not permitted to change any observable outputs of the program, and reordering float operations may alter the result due to precision loss. (There is a way to tell the compiler not to worry about precision loss, but it’s currently nightly-only).
Ah - this makes a lot of sense. I've had zero trouble getting excellent performance out of Julia using autovectorization (from LLVM) so I was wondering why this was such a "thing" in Rust. I wonder if that nightly feature is a per-crate setting or what?
The reason it’s a thing is from LLVM and I’m not sure you can “language design” your way out of this problem as it seems intrinsic to IEEE 754.
a+b+c != c+b+a
That’s why you need techniques like Kahan summation.
+ is a binary operation, and a+b+c can’t be interpreted without knowing whether one treats + as left-associative or right-associative. Let’s assume the former: a+b+c really means (a+b)+c.
If + is commutative, you can turn (a+b)+c into (b+a)+c or c+(a+b) or (commuting twice) c+(b+a).
But that last expression is not the same thing as (c+b)+a. Getting there requires associativity, and floating point addition is not associative.
IEEE 754 operations are nonassociative, but they are commutative (at least if you ignore the effect of NaN payloads).
Commutativity says that a*b = b*a, but that's not enough to allow arbitrary reordering. When you write a*b*c depending on whether * is left or right associative that either means a*(b*c) or (a*b)*c. If those are equal we say the operation is associative. You need both to allow arbitrary reordering. If an operation is only commutative you can turn a*(b*c) into a*(c*b) or (b*c)*a but there is no way to put a in the middle.
Edit: Wikipedia actually says associativity is definitionally about changing parens[0]. Mostly amounts to the same thing for standard arithmetic operators, but it’s an interesting distinction.
Rounding and eventual underflow in IEEE means an expression X•Y for any algebraic operation • produces, if finite, a result (X•Y)·( 1 + ß ) + µ where |µ| cannot exceed half the smallest gap between numbers in the destination’s format, and |ß| < 2^-N , and ß·µ = 0 . ( µ ≠ 0 only when Underflow occurs.)
And yes that is a binary relation only
a•b•c is really (a•b)•c assuming left operator associativity, one of the properties that IEEE doesn't have.
But remember that commutative is on the operations (+,x) which are binary operations, a+b=b+a and ab=ba, you can get accumulated rounding errors on iterated forms of those binary operations.
https://doc.rust-lang.org/std/intrinsics/index.html
Specifically the *_fast intrinsics.
I wonder if it could autovec the simd-ordered code.
For loops without such dependencies Rust should autovectorize just fine as with any other element type.
Edit: go to godbolt and load the rust aligned sum and play around with types. If you see addps that is the packed scalar simd instruction. The more you get packed, the higher your score! You'll need to pass some extra arguments they don't list to get avx512 sized registers vs the xmm or ymm ones. And not all the instances it uses support avx512 so sometimes you have to try a couple times.
I thought Rust was getting financial support from Google, Microsoft and Mozilla? Or was the Rust Foundation just a convenient way for Mozilla to fire a large amount of developers and we are actually rapidly approaching the OpenSSL Heartbleed state. Where everyone is happily building on a secure foundation that is maintained by a half dead intern when he isn't busy begging for scraps on the street?
Since then several Rust project developers did find full time jobs in companies like Amazon, Meta etc. But corporate interest ebbs and flows. For example Huawei employed a couple of engineers to improve the compiler for several years but they lost interest a couple of months ago.
The Rust Foundation is powered by donations, but a lot of its expenses went on funding infrastructure, security, legal expenses. But this problem of funding the maintainers of the project is on their radar. Yesterday they started an initiative to fund raise for the Maintainers Fund, with the money going straight to maintainers who aren’t being paid by their employer to do it full time. (https://rustfoundation.org/media/announcing-the-rust-foundat...)
the one that most contributes to open source from the largest corporations. so one of my favourites because of that
they were also one of the first of the large corps to show interest in Rust
Another example, note how these implementations, one in unsafe C# and another in safe F# have almost identical performance: https://benchmarksgame-team.pages.debian.net/benchmarksgame/..., https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
I’m sure more people than ever are working on the compiler. What’s going on?
The structure is unlike a traditional company. In a traditional company, the managers decide the priorities and direct the employees what to work on while facilitating that work. While there are people with a more managerial type position working on rust compiler, their job is not to tell the volunteers what to work on (they cannot), but instead to help the volunteers accomplish whatever it is they want to do.
I don't know about std::simd specifically, but for many features, it's simply a case of "none of the very small number of people working on the rust compiler have prioritized it".
I do wish there was a bounty system, where people could say "I really want std::simd so I'll pay $5,000 to the rust foundation if it gets stabilized". If enough people did that I'm sure they could find a way to make it happen. But I think realistically, very few people would be willing to put up even a cent for the features they want. I hear a lot of people wishing for better const generics, but only 27 people have set up a donation to boxy (lead of the const generics group https://github.com/sponsors/BoxyUwU ).
Seems smart to put the language as a requirement for compiling the linux kernel and a bunch of other core projects then!
Weren't a bunch of modules deprecated recently as a consequence of intel layoffs?
Corporate-sponsored contributions are probably the majority, but I don't think true volunteers are super-rare. But in both cases they're a "volunteer" from the perspective of the Linux leadership - they're contributing the changes that they want to make, they're not employees of Linux who can be directed to work on the things that that leadership thinks is important.
(And conversely it's the same with Rust - a lot of those volunteer contributors work for employers who want Rust to have some specific functionality, so they employ someone to work on that)
Leaving aside any specific blockers:
- It's a massive hard problem, to build a portable abstraction layer over the SIMD capabilities of various CPUs.
- It's a massive balance between performance and usability, and people care deeply about both.
- It's subject to Rust's stability guarantee for the standard library: once we ship it, we can't fix any API issues.
- There are already portable SIMD libraries in the ecosystem, which aren't subject to that stability guarantee as they can ship new semver-major versions. (One of these days, I hope we have ways to do that for the standard library.)
- Many people already use non-portable SIMD for the 1-3 targets they care about, instead.
This is something a lot of people (myself included) have gotten tripped up by. Non-portable SIMD intrinsics have been stable under std::arch for a long time. Obviously they aren't nearly as nice to hold, but if you're in a place where you need explicit SIMD speed-ups, that probably isn't a killer.
The thing that isn't stable in the standard library is the portable abstraction layer atop those. But several of those exist in the community.
Can’t APIs be fixed between editions?
1. A high bar for quality in std
2. Dependencies on other unstable features
3. Known bugs
4. Conflicts with other unstable features
It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.
I think there is also some sampling bias. Tons of features get stabilized, but you are much more likely to notice a nightly feature that is unstable for a long time and complex enough to be excited about.
Yep and this is why many features die or linger on forever. Getting the trait solving working correctly across types and soundly across lifetimes is complicated enough to have killed several features previously (like specialization/min_specialization). It was the reason async trait took so long and why GAT were so important.
AFAIK that’s not a blocker for Rust - the std library is allowed to use unstable at all times.
Too many promises have been made.
Rust needs more unsafe opt outs. Ironically simd has this so it does not bother me.
However, like other commenters I assume it’s because it’s hard, not all that many users of Rust really need it, and the compiler team is small and only consists of volunteers.
Most interesting performance optimizations from vector ISAs can't be done by the compiler.
And since Rust basically sells itself on high standards (zero-cost abstractions, etc.) the devs go back and forth until it feels like the solution is handed down from the heavens.
There is nothing blocking high quality SIMD libraries on stable in Rust today. The bar for inclusion in std is just much higher than the rest of the ecosystem.
1. It is highly depending on LLVM intrinsics which itself can change quite a lot. Sometimes the intrinsic would even fail to instantiate and crashed the entire compilation. I for example met chronic ICE crashes for the same code in different nightly Rust version. Then I realize it is because the SIMD operation was too complicated and I need to simplify it, and sometimes need to stop recursing and expanding too much to prevent stack spilling and exhausting register allocation.
This happens from time to time especially when using std::simd with embedded target where registers are scarcity.
2. Some hardware design decisions making SIMD itself not ergonomic and hard to generalize, this is also reflected on the design of std::simd as well.
Recall that SIMD techniques stems from vector processors in supercomputers from the likes of Cray and IBM, that is from the 70s and back then computation and hardware design was primitive and simple, so they have fixed vector size.
The ancient design is very stable, and is still kept till this day, even with the likes of AVX2, AVX512, VFP and NEON, so this influenced the design of things like lane count (https://doc.rust-lang.org/std/simd/struct.LaneCount.html).
But the plot twist: as time goes on, it turns out that modern SIMD is now capable of doing variable sizes; RISC-V's SIMD extension is one such implementation for example.
So now we come to a dilemma on to keep the existing fixed lane count design, or allow it to extend further. If we allow it to extend further to cater for things like variable-SIMD vector length, then we need to wait for generic_const_exprs to be stable, and right now it is not only not stable but incomplete too (https://github.com/rust-lang/portable-simd/issues/416).
This is a hard design philosophical change and is not easy to deal with. Time will tell.
3. As an extension to #2, the way that thinking in SIMD is hard in the very first place, and to use it in production you even have to think about different situations. This come in the form of dynamic dispatch, and it is a pain to dealt with, although we have great helpers such as multiversion...it is still very hard to design an interface that scales. Take Google's highway (https://github.com/google/highway/blob/master/g3doc/quick_re...) for example, it is the library to write portable SIMD code with dynamic dispatch in C++, but in an esotheric and not so ergonomic way. How we could do better with std::simd is still a myth. How do you abstract the idea of scatter-gather operation? What the heck is swizzle? Why do we call it shuffle and not permutation. Lots of stuff to learn, that means lots of pain to go through.
4. Plus, when you think in SIMD, there could be multiple instructions and multiple ways to do the same thing, one maybe more efficient than the other.
For example, as I have to touch some finite field stuff in GF(2^8), there are few ways to do finite field multiplication:
a. Precomputed table lookup
b. Russian Peasant Multiplication (basically carryless Karatsuba multiplication, but oftenly reduce to the form of table lookups as well, can also seen as ripple counter with modulo arithmetic except carry has to be delivered in a different way)
c. Do an inner product and then do Barrett reduction (https://www.esat.kuleuven.be/cosic/publications/article-1115...)
d. Or just treat it as multiplcation over a polynominal power series but this essentially mean we treat it as a finite field convolution, which I suspect is highly related to fourier transform. (https://arxiv.org/pdf/1102.4772)
e. Use the somewhat new GF2P8AFFINEQB (https://www.felixcloutier.com/x86/gf2p8affineqb) from GFNI which, contrary to most people who think it is available for AVX512 only, but is actually available for SSE/AVX/AVX2 as well (this is called GFNI-SSE in gcc), so it works on my 13600KF too (except obviously I cannot use ZMM registers or I just get illegal instruction for any instructions that touches ZMM or uses the EVEX encoding). I have an internal implementation of finite field multiplication using just that, but I need to use the polynomial of 0x11D rather than 0x11B so GF2P8MULB (https://www.felixcloutier.com/x86/gf2p8mulb) is out of question (which is supposed to be the fastest in the world theoretically if we can use arbitary polynomial), but this is rather hard to understand and explain in the first place. (by the way I used SIMDE for that: https://github.com/simd-everywhere/simde)
All of these can be done in SIMD, but each one of these methods have its pros and cons. Table lookup maybe fast and seemingly O(1) but you actually need to keep the table in cache, meaning we trade time with space, and SIMD would amplify the cache thrashing from multiple access. This could slow down the CPU pipeline although modern CPU are clever enough on cache management. If you want to do Russian Peasant Multiplication then you need a bunch of loops to go through the division and XOR chunk by chunk.
If you want Barrett reduction then you need to have efficient carryless multiplication such as PCLMULQDQ (https://www.felixcloutier.com/x86/pclmulqdq), to do the inner product and reduce the polynomial. Or a more primitive way find ways to do finite field Horner's method in SIMD...
How to think in SIMD is already hard as said in #3. How to balance in SIMD like this is even harder.
Unless you want to have a certain edge, or want to shatter the benchmark, I would say SIMD is not a good investment. You need to use SIMD at the right scenario at the right time. SIMD is useful, but also kind of niche, and modern CPU is optimized well enough that the performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether, and most of the time if you want to use SIMD, using the easiest SIMD methods is generally enough.
*: I myself used std::simd intensively in my own project, well it got refused that the paper was actually severely lacking in literature studies, but that I shouldn't have used LLM too much to generate the paper.
However, the code was here (https://github.com/stevefan1999-personal/sigmah). Now I have a new approach to this problem that is derived from my current work with finite field, error correction, divide and conquer and polynominal multiplication, and I plan to resubmit the paper once I have time to clear it, with a more careful approach next time too, although the problem of string matching with don't care can be seen as convolution and I doubt my approach would ended up something like that...making the paper still unworthy for acceptance.
A quick comment on this one point (personal opinion): from a hyperscalar perspective, scalar code is most certainly not enough. The energy cost from scheduling a MUL instruction is something like 10x of the actual operation it performs. It is important to amortize that cost over many elements (i.e. SIMD).
I also added packing and unpacking helpers that assist with handling final lane 0 values etc. But there is still some subtly, as the article pointed out, compared to using Rayon or non-SIMD CPU code related to packing and unpacking. E.g. you should try to keep things in their SIMD form throughout the whole pipeline, how you pair them with non-SIMD values (Like you might pair [T; 8] with f32x8 etc) etc.
Can't you just make a local copy of the existing package and use that? Did you need to re-implement?
I thought the intrinsic specifically were available in plain safe rust and the alignment required intrinsics were allowed in unsafe rust. I’m not sure I understand this “direct to llvm dispatch” argument or how that isn’t accessible to stable Rust today.
e.g. all core::simd addition ends up invoking the single function [1] which is then directly handled by rustc. But these architecture-agnostic intrinsics are unstable[2] (as they're only there as a building block for core::simd), and you can't manually use "#[rustc_intrinsic]" & co in stable rust either.
[1]: https://github.com/rust-lang/rust/blob/b01cc1cf01ed12adb2595...
[2]: https://github.com/rust-lang/rust/blob/b01cc1cf01ed12adb2595...
Zig has @Vector. This is a builtin, so it gets resolved at comptime. Is the problem with Rust here too much abstraction?
This matches the conclusion we reached for Chromium. We were okay with nightly, so we're using `std::simd` but trying to avoid the least stable APIs. More details: https://docs.google.com/document/d/1lh9x43gtqXFh5bP1LeYevWj0...
How many people are writing somewhat bog standard RUST/C and expect optimal assembly to be created?
In the end the only bits where I resorted to assembly were the ones where it wouldn't make any sense to write stuff in C. Bootloaders, for instance, when all you have to work with is 512 bytes the space/speed constraints are much more on the space side and that's where I find I still have a (slight) edge. Which I guess means that 'optimal' is context dependent and that the typical 'optimal' defaults to 'speed'.
I've literally had debates with people that thought a CSV file would be smaller than holding the same data in memory. Senior level developers at both startups and established companies. My hunch is they had only ever done this using object oriented modeling of the data. Worse, usually in something like python, where everything is default boxed to hell and back.
> How many people are writing somewhat bog standard RUST/C and expect optimal assembly to be created?
As for:
> I don't necessarily think "low level" has to be "writing assembly." I do think it means, "knows the full layout of the data in memory." Something a surprising number of developers do not know.
Agreed. But then again, there are ton of things a surprising number of developers don't know, this is just another one of those.
Similar stuff:
- computers always work
- the CPU is executing my instructions one after the other in the same order in which they are written on the line(s)
- the CPU is executing my instructions only one at the time
- if I have a switch (or whatever that construct is called in $language) I don't need to check for values I do not expect because that will never happen
- the data I just read in is as I expect it to be
You can probably extend that list forever.
Your CSV example is an interesting one, I can think of cases where both could be true, depending on the kind of character encoding used and the way the language would deal with such a character encoding. For instance in a language where upon reading that file all of the data would be turned into UTF-16 then, indeed, the in memory representation of a plain ASCII CSV file could well be larger than the input file. Conversely, if the file contained newlines and carriage returns then the in-memory representation could omit the CRs and then the in memory representation would be smaller. If you turn the whole thing into a data structure then it could be larger, or smaller, depending on how clever the data structure was and whether or not the representation would efficiently encode the values in the CSV.
> My hunch is they had only ever done this using object oriented modeling of the data.
Yes, that would be my guess as well.
> Worse, usually in something like python, where everything is default boxed to hell and back.
And you often have multiple representations of the same data because not every library uses the same conventions.
Usually one only bothers with the intrinsic level stuff for the use cases you're saying. E.g. video encoders/decoders needing hyper-optimized, per architecture loops for the heavy lifting where relying on the high level SIMD abstractions can leave cycles on the table over directly targeting specific architectures. If you're just processing a lot of data in bulk with no real time requirements, high level portable SIMD is usually more than good enough.
To that end, even "calling math functions" is something that a surprising number of developers don't do. Certainly not with the standard high level data types that people often try to write their software into. No?
By "calling math functions" I mean things like:
let x = 5.0f64;
let result = x.sqrt()
Where most CPUs have a sqrt instruction but the program will automatically compile with a (good) software substitution for targets that don't. It's very similar with portable SIMD - the high level call gets mapped to whatever the target best supports automatically. Neither SIMD nor these kind of math functions work automatically with custom high level data types. The only way to play for those is to write the object to have custom methods which break it down to the basic types so the compiler knows what you want the complex type's behavior to be. If you can't code that then there isn't much you can do with the object, regardless of SIMD. With intrinsics you need to go a step further beyond all that and directly tell the compiler what specific CPU instructions should be used for each step (and make sure that is done safely, for the remaining unsafe operations).Is that not the case?
(that said autovectorization should work, and fixed-width SIMD should map to RVV as best as possible, though of course missing out on perf if ran on wider-than-minimum hardware not on a native build)
Also RISC-V, where you can't even probe for extension support in user space unfortunately.
[1]: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/s...