Also that parallel Fibonacci strikes me as a very bad example. Usually we use parallelism because we want to make something run faster: you can probably add 100,000 pairs of numbers for what a context switch costs so this certainly takes longer. More complicated, likely buggier, slower, what's to like about it?
You can get very consistent results with something that works like the Executor in Java if your task can be batched into something that takes longer to process than a context switch and you have a good heuristic to pick the thread count, usually there is a wide range over which you get decent performance, say 15-100, but so often I've seen work-stealing systems only use 2 or 3 threads when there are 16 cores available.
On the utilization and context switch concerns, i feel they typically stem from centralized queues or allocator contention in child-stealing systems (like TBB). Cactus uses continuation-stealing instead: on FORK, the child runs immediately as a normal function call, and the parent's continuation (just 24 bytes: RBP/RSP/return address) goes onto a per-worker deque.
If nobody steals it, the parent reclaims it at JOIN with an atomic decrement. No allocation, no stack switch, no context switch on the fast path. A stack switch only happens when a thief actually steals work and grabs a recycled slab from the pool.
The other scalability killer is usually lock-based synchronization at join points. I tried to avoid this by using atomic counters for the worker/thief handoff instead of mutexes. You can test scaling yourself: CACTUS_NPROCS=1 build/cc/nqueens 14 vs CACTUS_NPROCS=N build/cc/nqueens 14.
That said, I totally agree an Executor model is the right tool for flat workloads. I built this specifically for recursive divide-and-conquer (game trees, mergesort, mesh refinement) where you'd otherwise have to manually flatten the recursion or risk deadlocking a fixed-size pool.