The author should read https://webkit.org/blog/6161/locking-in-webkit/ so that they understand what they are talking about.
WebKit does it right in the sense that:
- It as an optimal amount of spinning
- Threads wait (instead of spinning) if the lock is not available immediately-ish
And we know that the algorithms are optimal based on rigorous experiments.
> - It as an optimal amount of spinning
No it isn't, it has a fixed number of yields, which has a very different duration on various CPUs
> Threads wait (instead of spinning) if the lock is not available immediately-ish
They use parking lots, which is one way to do futew (in fact, WaitOnAddress is implemented similarly). And no if you read the code, they do spin. Worse, they actually yield the thread before properly parking.
You say this with zero data.
I know that yielding 40 times is optimal for WebKit because I measured it. In fact it was re-measured many times because folks like you would doubt that it could’ve optimal, suggest something different, and then again the 40 yields would be shown to be optimal.
> And no if you read the code, they do spin. Worse, they actually yield the thread before properly parking.
Threads wait if the lock is not available immediately-ish.
Yes, they spin by yielding. Spinning by pausing or doing anything else results in worse performance. We measured this countless times.
I think the mistake you’re making is that you’re imagining how locks work. Whereas what I am doing is running rigorous experiments that involved putting WebKit through larger scale tests
> You say this with zero data.
Wouldn't the null hypothesis be that the same program behaves differently on different CPUs? Is "different people require different amounts of time to run 100m" a statement that requires data?
Or so you assume
> Spinning by pausing or doing anything else results in worse performance. We measured this countless times.
And I've seen the issue in hundreds of captures using a profiler. I suppose we just have a different definitions of the what "worse performance" is.
> Whereas what I am doing is running rigorous experiments that involved putting WebKit through larger scale tests
Or perhaps the fish was drown in the stats, or again different metrics.
You're not including data in your discussion of this topic. Your post included zero data.
My post on WTF locks has tons of data.
So, I'm not assuming; I'm observing.
> And I've seen the issue in hundreds of captures using a profiler. I suppose we just have a different definitions of the what "worse performance" is.
Nobody cares what you saw in the profiler.
What matters is the performance users experience.
By any metric of observable performance, yielding is the optimal way of spinning.
The direct link to Intel 404s.
https://github.com/golang/go/blob/2bd7f15dd7423b6817939b199c...
https://github.com/golang/go/blob/2bd7f15dd7423b6817939b199c...
For the yield part, I already linked to the part that shows that. Yes it doesn't call yield if it sees others are parked, but on quick lock/unlock of threads it happens that it sees nobody parked and fails, yielding directly to the OS. This is not frequent, but frequent enough that it can introduce delay issues.
The only time I've manually written my own spin lock was when I had to coordinate between two different threads, one of which was running 16-bit code, so using any library was out of the question, and even relying on syscalls was sketchy because making sure the 16-bit code is in the right state to call a syscall itself is tricky. Although in this case, since I didn't need to care about things like fairness (only two threads are involved), the spinlock core ended up being simple:
"thunk_spin:",
"xchg cx, es:[{in_rv}]",
"test cx, cx",
"jnz thunk_has_data",
"pause",
"jmp thunk_spin",
"thunk_has_data:",The following is an interesting talk where the author used a custom spinlock to significantly speed up a real-time physics solver.
Dennis Gustafsson – Parallelizing the physics solver – BSC 2025 https://www.youtube.com/watch?v=Kvsvd67XUKw
Yes, but sadly not all implementations... The point remains that you should prefer OS primitives when you can, profile first, reduce contention, and then only, maybe, if you reeeally know what you're doing, on a system you mostly know and control, then perhaps you may start doing it yourself. And if you do, the fallback under contention must be the OS primitive
If you want scheduling, then the scheduler needs to be aware of task dependencies and you must accept that your task will be interrupted.
When a lock is acquired on resource A by the first thread, the second thread that tries to acquire A will have a dependency on the release of A, meaning that it can only be scheduled after the first thread has left the critical section. With a spinlock, the scheduler is not informed of the dependency and thinks that the spinlock is performing real work, which is why it will reschedule waiting threads even if resource A has not been released yet.
If you do thread pinning and ensure there are less threads than CPU cores, but still have other threads be scheduled on those cores, it might still work, but the latency benefits are most likely gone.
Real Windows, and Linux, don't have this problem. Only Wine's "malloc" in a DLL, which does.
Bug reports resulted in finger-pointing and denial.[1] "Unconfirmed", despite showing debugger output.
One area where I found spinlocks to be useful is in multithreaded audio applications. Audio threads are not supposed to be preempted by other user space threads because otherwise they may not complete in time, leading to audio glitches. The threads have a very high priority (or have a special scheduling policy) and may be pinned to different CPU cores.
For example, multiple audio threads might read from the same sample buffer, whose content is occasionally modified. In that case, you could use a reader-writer-spinlock where multiple readers would be able to progress in parallel without blocking each other. Only a writer would block other threads.
What would be the potential problems in that scenario?
My example above was rather about the DSP graphs themselves that are computed in parallel. These require to access to shared resources like audio buffers, but under no circumstance should they give up their timeslice and yield back to the scheduler. That's why we're using reader-writer spinlocks to synchronize access to these resources. I really don't see any other practical alternative... Any ideas?
Edit: Maybe I'm confusing terminology. What I'm doing is looping until other threads returned memory, but I'm also doing a short sleep during each loop iteration.
rdtsc may execute out of order, so sometimes an lfence (previously cpuid) can be used and there is also rdtscp
See https://github.com/torvalds/linux/blob/master/arch/x86/inclu...
And just because rdtsc is constant doesn't mean the processor clock will be constant that could be fluctuating.
I found somewhere (https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpu...) that the pause instruction had this wild cycle difference between different CPU and it caused some grief, I had no idea. I stopped doing low level coding a while back.
The issue isn’t just tearing but also memory order. On some architectures you can read a valid but out of date value in Thread A after Thread B has updated that value. (Memory order is mentioned later in the article, to be fair.)
Stealing is an example of an unfairness which can significantly improve overall performance.
I suppose someplace someone is running an embedded system without an OS on such a processor - but I'd expect they are still using extra cores and so have all of the above tricks someplace.
Yes it matters on MS-DOS like OS design, like some embedded deployments and that is about it.
It is even impossible to guarantee a process doesn't get rescheduled into another CPU with the performance impact it entails, unless the process explicitly sets its CPU affinity.
The way it is measured, is mostly ideal, assuming that threads run to completion without any of those side effects taking place.
Turns out the first scenario is rare outside of embedded or OS development. The second scenario defeats the purpose because you're doing the same thing a mutex would be doing. It's not like mutexes were made slow on purpose to bully people. They're actually pretty fast.
It's a weak signal but the reasoning is sound.
Code has a minimum complexity to solve the problem
You may have some simple chemical life needs, and life may have some other simple chemical it can use to get the needed simple chemical, but the processing steps are complex and limited by physics themselves. Evolution almost always finds a path of using the minimum activation energy to let these reactions occur. Trying to make the process simpler just doesn't get you what you need.
Everyone's computers hang or get slow some of the time. Probably all of our locks have bugs in them, but good luck getting to the bottom of that, right now the industry is barely capable of picking a sorting algorithm that actually works.