“This is why Linux now provides rseq() which is a much more enlightened solution. With restartable sequences, you actually can get rid of both the mutex and atomics, while the OS continues to fully abstract scheduling. The way it works is you advise the kernel whenever your program enters a critical section of code that you don't want interrupted. It's probably going to be maybe 10 assembly instructions tops. The first assembly opcode should be a move instruction that sets the rseq_cs field. The last instruction needs to be the thing that makes the modification to your global data structure. Think of it sort of like a really tiny database transaction. What makes it go fast, is that the bidirectional communication with the kernel happens via shared memory.”
But agreeing with you, I've done big optimization stuff for multicore servers (not as many cores, but same kind of work) and my workstation was something small with not even the same os. I don't need the big machine on my desk to understand the concepts. I just need the big machine to check my work. For me, that's always been a production machine, sometimes a production machine taken out of rotation for pre-validation before running on production load. I guess I should mention, I work on applications specifically, and libraries and kernels as it relates to making whatever my application is work better. I also don't have a problem with pinning threads to cpus... but my applications are usually one big program that fills the system. Someone writing a general purpose library has a harder time.
Of course, if you want to do this kind of work and you don't have your own production load, you're going to have to borrow, rent, or buy a big machine. It doesn't need to be your workstation though. I hate working with cloud nonsense, but if your tests are short, and you do the upfront work to make your images start fast, you can probably save a lot of money by renting spot instances when testing ... I don't know if you can do spot instances of bare metal though, so you're probably stuck with vm overhead.
https://github.com/compudj/librseq
This has helpers for common use cases like counters and linked lists. You shouldn't need to write assembly at all to use rseq in most applications.
The key insight is that the preempter can introspect the program counter of the code being preempted (which is now stable since it was preempted) and act accordingly. The simplest mechanism is to reset their program counter if in a critical section. The more generic mechanism is to jump them to a supplied address. This allows you to do things like hard abort and more.
You can further remove the need for the preempter to understand the preempted code by having the preempted code create a self-introspection code snippet and supplying that with the program counter at preemption. So the preempter just vectors them to their own code which knows how to interpret its own state at any preemption point.