1. Write a function implementing the happy-path SIMD
2. (eventually) Write a function implementing the cleanup code
3. (eventually) Implement the final result by appropriately calling into (1) and (2)
That gives you a few benefits. Notably, every problem in TFA (other than the inlining limit) magically goes away. Also, in the vast majority of problems, your cleanup code is itself a valid solution to the problem and also easy to understand, so for free you have a correct implementation to fuzz your fast implementation against (a testing pattern I highly recommend with all optimized code).
Most importantly though, 99 times out of 100 I actually don't need (2) or (3). You can push the padding and alignment constraints all the way to the top level of your program, and the rest of your code can work on slices of vectors instead of slices of floats (or whatever other safe representation you want to choose). The resulting code is much easier to maintain, and the compiled binary is usually smaller and faster.
You cannot work on SIMD micro-optimizations without assessing the final compiler codegen. Compilers will easily cancel your optimizations effort once you move your code from an isolated test-case to an actual binary that is being released.
Otherwise profilers don't really help much if you don't know what are you looking for. And to know what to look for you really need to understand how the CPU internals work. There's no way around it. May you switch to another CPU and you will find that your optimization scales differently.
Most existing CPUs can execute at least 2 conditional jumps a.k.a. branches per clock cycle, if they are predicted as not taken.
Such CPUs can execute only one conditional jump per clock cycle when it is predicted as taken, which also means that in the same clock cycle one can speculatively execute only instructions that are beyond a single conditional jump that is predicted as taken.
So in most CPUs at least 2 conditional jumps can be predicted per clock cycle, but whether they can also be executed depends on the result of the prediction.
As mentioned in an endnote of the article, in the latest generation of CPUs from multiple companies, including AMD Zen 5, the ability to execute in a single clock cycle 2 branches predicted as taken has appeared (and the ability to speculatively execute instructions past the second taken branch).
Because of this difference between taken and not taken branches, it is always good for a compiler or for a human programmer to generate code such that all the conditional jumps (except those that terminate loops) are statically predicted as not taken. Both "if ... then ..." and "if ... then ... else ..." conditional statements have alternative implementations that can ensure that all conditional jumps are statically predicted as not taken. For "switch" a.k.a. "select" a.k.a. "case" instructions it is always possible to reorder the tested cases such that all the conditional jumps are statically predicted as not taken.
Unfortunately, in most programming languages it is not possible to provide a hint for the compiler on whether a tested condition is expected to be true or false. Such a hint could be easily provided in a programming language that has both "if" and "unless" conditional statements or expressions, where "if" could mean that the conditional statement is expected to be not executed and "unless" could mean that the conditional statement is expected to be executed (for the "else" or "otherwise" branch the expectation would be opposite).
How do you track these optimization attempts, given you don't commit many dead-end changes? CS major often doesn't teach how to experiment, which is a bummer as there supposed to be established ways to do it among other scientific disciplines. Maybe you can use something like Jupyter notebook? But what if the code is in, say, Rust and lives outside the Jupyter kernel.
Might be an hour of additional effort at the beginning of the project, but will prove to be insightful at the end.
Benchmarking is the only thing you can really do in some cases.
I think the complete opposite! Benchmarking is very difficult to get right and in some situations you couldn't really make a benchmark to test what you're interested in improving. Reading assembly can be objective if you know or look up the instruction latency/throughput of each instruction (or you have a tool provide it) and you can also use loop throughput analyzers (as the other commenter mentioned) that will try to predict the typical throughput for a given loop.
IMO if you get used to looking at assembly it becomes obvious in the majority of cases whether there's performance left on the table.
And "simplicity" does not equal "faster". Although for cold code, reducing its impact on i$ and d$ of the rest of the system is probably smart, so sometimes speed is not the only factor.
If only metric you look for is number of instructions then yes and this is a wrong way of looking at this type of problems. One can write a 50 LoC heavily optimized SIMD code that outperforms the compiler generated 10 LoC by 5x.
And once you're done, you can do another round with tools which can do more detailed analysis like https://uica.uops.info
I do this when writing performance-sensitive C# all the time (it's very easy to quickly get compiler disassembly) and it saves me a lot of time.
they're cool tools, I encourage anyone to mess with them