Observed roughly 5x wall-clock improvement for implementation work. What took Claude 3–4 minutes finished in under a minute. Not a controlled benchmark — just consistent observation across a dozen tasks today.
The 5x sitting between the 1.37x (SWE-Bench at matched accuracy) and the headline 15x makes sense to me. Benchmarks isolate problem-solving. Real coding sessions are dominated by "here's a clear spec, write it" work where raw throughput matters more than reasoning depth. That's where Spark pulls ahead hardest.
To solarkraft's point — what surprised me too was that quality didn't noticeably degrade. Speed and comparable accuracy, at least for well-defined tasks. I expected a tradeoff. Didn't find one for this class of work.
The gap closes fast for anything requiring architectural reasoning across the full codebase. When the problem is ambiguous, Claude still wins clearly.
What I'd recommend: let Claude handle all the reasoning — architecture, tradeoff analysis, spec writing — then hand Spark a fully defined task and force it to just execute.
Don't let Spark think. Let it build. The combined throughput is significantly higher than either model alone
This has been the case for people who buy into hype and don’t actually use the products, but I’m pretty sure people who do are pretty disillusioned by all the claims. The only somewhat reliable method is to test the things for your own use case.
That said: I always expected the tradeoff of Spark to be accuracy vs. speed. That it’s still significantly faster at the same accuracy is wild. I never expected that.