Note, this is not the recommended way to use Rust feature flags. They are additive and so the correct way to make a `no_std` compatible crate is to have a `std` feature flag that conditionally enables use of the `std` library.
Referring to Effective Rust:
> Note that there's a trap for the unwary here: don't have a no_std feature that disables functionality requiring std (or a no_alloc feature similarly). As explained in Item 26, features need to be additive, and there's no way to combine two users of the crate where one configures no_std and one doesn't—the former will trigger the removal of code that the latter relies on. - https://www.lurklurk.org/effective-rust/no-std.html
You can with Rust features. It allows a library to conditionally compile certain parts of the code, and users of the library can decide which features to enable.
If a library only uses stdlib in code gated by #[feature(std)], users can disable that flag, and use the library in no_std contexts.
But maybe i should first learn rust before asking.
Consider these common libraries you might use in either a `std` project (PC application, web microservice) or `no_std` project (embedded microcontroller firmware, bootloader, Linux kernel module, blockchain smart contract):
- data encoding (https://crates.io/crates/base64 for instance),
- hashing (SHA2 https://github.com/RustCrypto/hashes/tree/master/sha2),
- data structures (https://github.com/Lokathor/tinyvec)
- time/date manipulation (https://docs.rs/chrono/latest/chrono/)
https://docs.rs/jiff/latest/jiff/tz/index.html#core-only-env...
I think with sloppy/complex code it could start to resemble #ifdef PLATFORM complexity if you do a lot inline, but cargo workspaces are a good way to reduce the blast radius.
This is not in any way the problem.
The reason for having std be the feature instead of having no_std be the feature is because if a dependency unsets the default features because it does not rely on the std feature and then someone else still does rely on the std feature then everything will still work properly.
If no_std is the feature then if a dependency sets the no_std feature because it does not rely on the std features and then someone else does not set no_std because they do rely on the std features then there will be problems because there won't be any ability for the someone else to unset the no_std feature that was specified by the first dependency.
as with anything in that regard. profile, profile, profile. valgrind, check cache misses and profile. calculate theoretical throughput of a cpu you're working on, like actual bandwidth of reading/writing RAM, with and without caching, and that's your high post; to try to get as close as possible to those limits. If you want to start with that, you can do just that, simple reads/writes and profile and then introduce functions and structures instead and try to reclaim speed as much as possible. graphs over profiling always help, even better graphs over profiling on commits or PRs so you can tell how you're progressing. But that's just like my opinion, man. No right/wrong way, profiling always tells the truth in the end.
tl;dr; read ops for cpu; profile.
If I'm shopping for a vector library, this is one of the pieces of information that makes the decision easier.
* nalgebra uses fixed-size arrays (so a Vec4 is like [[f32; 4]; 1])
* this library seems to use fields (so a Vec4 is a struct with x,y,z,w fields)
* glam uses SIMD types for some types (so a Vec4 is a __m128)
I think maybe glam might win for some operations, but if you want performance, people usually SIMD in the other direction when possible, like:
struct Vec4 { x: __m128, y: __m128, z: __m128, w: __m128 }
According to mathbench-rs[0] (which I looked at after typing this comment...) it looks like nalgebra and ultraviolet have such types. The benchmarks have "N/A" for many of the "wide" nalgebra entries though, which might indicate that nalgebra hasn't implemented many of those functions for "wide" types.Hint for language designers: when you design a new language, put this stuff, and multidimensional arrays, in the standard library. Multiple incompatible versions of such types is as bad for number-crunching as would be multiple incompatible string types for string manipulation. You want your standard numeric libraries to work on the standard types.
This is part of why Matlab is so successful. You don't have to worry about this stuff.
I think its from a CS education which treats the "naturals" as fundamental, vs an engineering background where the "reals" are fundamental, and matrix math _essential_ and people live on one side of this fence.
- Floating point operations used to be slow. On early PCs, you didn't even have a floating point unit. AutoCAD on DOS required an FPU, and this was controversial at the time.
- Using the FPU inside system code was a no-no for a long time. Floating point usage inside the Linux kernel is still strongly discouraged.[1] System programmers tended not to think in terms of floating point.
- Attempts to put multidimensional arrays in modern languages tend to result in bikeshedding. If a language has array slices, some people want multidimensional slices. That requires "stride" fields on slices, which slows down slice indexing. Now there are two factions arguing. Rust and Go both churned on this in the early days, and neither came out with a good language-level solution. It's embarrassing that FORTRAN has better multidimensional arrays.
Now that the AI world, the GPU world, and the graphics world all run on floating point arrays, it's time to get past that.
[1] https://www.kernel.org/doc/html/next/core-api/floating-point...
NaN doesn’t have this optimization because the optimization isn’t generic across all possible representations. Trying to make it generic gets quite complex and floats might have many such representations (eg you want NaN to be optimized, someone else needs NaN and thinks infinity works better etc). In other words:
Nonzero is primarily for size optimization of Option<number>. If you want sentinels, then write your own wrapper, it’s not hard.
If your vector is generic (using cpp syntax here): vec<3 float> then you can just put in vec<3, float4> and then solve 4 vector math problems at a time.
It helps tremendously if your interfaces already take N inputs at a time, so then instead of iterating one at a time you do 4 at a time.
> If your vector is generic (using cpp syntax here): vec<3 float> then you can just put in vec<3, float4> and then solve 4 vector math problems at a time.
Yeah, that's the idea, but for anyone reading, the main complication is when you need to branch. There are usually multiple ways to handle branching (e.g., sometimes it's worth adding a "fast path" for when all the branches are true, and sometimes it isn't; sometimes you should turn a branch into branchless code and sometimes you shouldn't) and AVX-512 adds even more ways to do it.
I did a small experiment comparing 6 possible implementations of the n-body [0] update loop: https://godbolt.org/z/sfehEfPGT
The implementations are:
* AOS: a simple scalar implementation with coordinates stored in an array of structs * SOA: a simple scalar implementation with coordinates stored as a struct of arrays * float3: uses a struct of three floats as a vector type * float4: uses a struct of four floats as a vector type, ignores the last element * vec4: like float4, but using a generic SIMD abstraction (so basically what glam does) * floats3: attempts to do SOA with nice syntax. floats3 type has three arrays of floats and there are operations to extract and store a float3 type from a given index.
Since these abstractions are often used in games I'll start of looking at what the compiler produces when targeting Zen5 with -O3 -ffast-math:
* Zen5 O3 ffast-math:
AOS: gcc: 11119 ~SSE clang: 3688 AVX512, but quite messy
SOA: gcc: 1283 AVX512 clang: 1202 AVX512
float3: 11050 ~SSE clang: 10894 ~SSE
float4: gcc: 8646 ~SSE clang: 10815 ~SSE
vec4: gcc: 7913 ~SSE clang: 8196 ~SSE
floats3: gcc: 1284 AVX512 clang: 13351 ~SSE
The numbers next to the compilers are the cycle estimates from the llvm-mca model of Zen5 for processing 1024 elements.
AVX512 indicates whether the compiler was able to vectorize the loop with AVX512, and ~SSE means it could be partial vectorization with SSE.Now let's also look at a different ISA, this time the RISC-V Vector extension:
* P670 2xVLEN O3 ffast-math:
AOS: gcc: 17445 clang: 3357 RVV
SOA: gcc: 3355 RVV clang: 3334 RVV
float3: gcc: 17445 clang: 17449
float4: gcc: 25668 RVV128 clang: 17470 RVV128
vec4: gcc: 45091 RVV128 clang: 23111 RVV128
floats3: gcc: 3333 RVV clang: 17446
This time the llvm-mca model for the SiFive-P670 was used, but I pretended it has 256-bit vectors instead of 128-bit ones, as the vector length is transparent to the codegen and this amplifies the effect I'd like to show.
RVV means it could be fully vectorized, while RVV128 is similar to ~SSE and means it could only partially take advantage of the lower 128-bit of the vector registers.So if you are using such vector types to do computations in loops you are likely to end up preventing your compiler from optimizing it for modern hardware. In general writing simple SOA scalar code seems to vectorize best, as long as you make sure the compiler isn't confused by aliasing. But even the plain old AOS scalar code can be vectorized by modern clang, but not by gcc, and sadly also not the float3/float4 implementations, which should be very similar. Modern ISAs like NEON/SVE/RVV have more complex vector load/stores that allow you to retrieve data more efficiently even from a traditionally bad data layout like AOS. You can dress up the SOA code to make it a bit nicer, unfortunately my attempt with floats3 currently only works properly with gcc.
Below are the results when compiling without -ffast-math:
* Zen5 O3:
AOS: gcc: 11819 ~SSE clang: 10788 ~SSE
SOA: gcc: 4146 AVX512 clang: 13734 AVX512
float3: 11826 ~SSE clang: 11499 ~SSE
float4: gcc: 8662 ~SSE clang: 11810 ~SSE
vec4: gcc: 8575 ~SSE clang: 7451 ~SSE
floats3: gcc: 4148 AVX512 clang: 14367 ~SSE
* P670 2xVLEN O3: AOS: gcc: 17464 RVV64 clang: 6122 RVV
SOA: gcc: 7140 RVV clang: 6118 RVV
float3: gcc: 17445 clang: 17464 RVV64
float4: gcc: 25665 RVV128 clang: 19184 RVV128
vec4: gcc: 17463 RVV128 clang: 56868 RVV128
floats3: gcc: 7140 RVV clang: 17444
Weirdly clang seems to be struggling with the SOA here, and overall vec4 looks like the best performance tradeoff for X86.
Still with proper SOA, and I bet you could coax clang into generating it as well, you can still get a 2x performance improvement.
Additionally, vec4 performs horribly with current compilers for VLA SIMD ISAs.I'll try to experiment with some real world code, if I can find some that is bottle-necked by such types.
[0] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Yes, I don't think anyone using them is depending on autovectorization.
What was/is not pretty is adding some generics so the same code can work with f32 and f64. I did manage to get something that works but it's so ugly that I didn't want to release it. I'm sure it could be improved but what I got works well enough and I haven't needed to touch it in a few years.
std::simd is quite pleasant to work with and most importantly it allows a zero cost fallback to CPU specific intrinsics when needed.
let mut d = a.dot(b);
d.normalize();
I think I found a bug in your readme. Dot product should return a scalar. I don't know Rust at all, but I've never met any languages that had a normalize method for scalars.Why not use #[rustfmt::skip]?