Cactus, a work-stealing parallel recursion runtime for C(github.com)

3 pointsby enduku5 hours ago2 comments

PaulHoule5 hours ago
My first take on work-stealing is to be skeptical. Like I don't know your exact case, but other work-stealing systems I've seen are not good at resource utilization.
Also that parallel Fibonacci strikes me as a very bad example. Usually we use parallelism because we want to make something run faster: you can probably add 100,000 pairs of numbers for what a context switch costs so this certainly takes longer. More complicated, likely buggier, slower, what's to like about it?
You can get very consistent results with something that works like the Executor in Java if your task can be batched into something that takes longer to process than a context switch and you have a good heuristic to pick the thread count, usually there is a wide range over which you get decent performance, say 15-100, but so often I've seen work-stealing systems only use 2 or 3 threads when there are 16 cores available.
- enduku4 hours ago
  Thanks for taking a look, really appreciate the thoughtful feedback! You're absolutely right about Fibonacci. It's a terrible performance example since the work per fork is basically zero :) I included it as an 8-line API showcase, and because it's almost pure overhead it doubles as a brutal stress test for the runtime. The nqueens, matmul, and quicksort benchmarks in the repo are the real performance indicators, imo. And yeah, batching and granularity control matter a lot, there's a section in the README on tuning the serial cutoff.
  On the utilization and context switch concerns, i feel they typically stem from centralized queues or allocator contention in child-stealing systems (like TBB). Cactus uses continuation-stealing instead: on FORK, the child runs immediately as a normal function call, and the parent's continuation (just 24 bytes: RBP/RSP/return address) goes onto a per-worker deque.
  If nobody steals it, the parent reclaims it at JOIN with an atomic decrement. No allocation, no stack switch, no context switch on the fast path. A stack switch only happens when a thief actually steals work and grabs a recycled slab from the pool.
  The other scalability killer is usually lock-based synchronization at join points. I tried to avoid this by using atomic counters for the worker/thief handoff instead of mutexes. You can test scaling yourself: CACTUS_NPROCS=1 build/cc/nqueens 14 vs CACTUS_NPROCS=N build/cc/nqueens 14.
  That said, I totally agree an Executor model is the right tool for flat workloads. I built this specifically for recursive divide-and-conquer (game trees, mergesort, mesh refinement) where you'd otherwise have to manually flatten the recursion or risk deadlocking a fixed-size pool.
  - PaulHoule2 hours ago
    Great answer!
enduku5 hours ago
[dead]