15× vs. ~1.37×: Recalculating GPT-5.3-Codex-Spark on SWE-Bench Pro(twitter.com)

2 pointsby nvanlandschoot7 hours ago3 comments

pplar393 hours ago
Adding a data point from actual use. I've been building a large Rust project (207 files, ~39K lines) primarily through Claude Code in orchestrate mode. Today I ran the same types of tasks through Spark via Codex CLI on a bare-metal Linux box — test module generation, file-level refactoring, struct generation from specs.
Observed roughly 5x wall-clock improvement for implementation work. What took Claude 3–4 minutes finished in under a minute. Not a controlled benchmark — just consistent observation across a dozen tasks today.
The 5x sitting between the 1.37x (SWE-Bench at matched accuracy) and the headline 15x makes sense to me. Benchmarks isolate problem-solving. Real coding sessions are dominated by "here's a clear spec, write it" work where raw throughput matters more than reasoning depth. That's where Spark pulls ahead hardest.
To solarkraft's point — what surprised me too was that quality didn't noticeably degrade. Speed and comparable accuracy, at least for well-defined tasks. I expected a tradeoff. Didn't find one for this class of work.
The gap closes fast for anything requiring architectural reasoning across the full codebase. When the problem is ambiguous, Claude still wins clearly.
What I'd recommend: let Claude handle all the reasoning — architecture, tradeoff analysis, spec writing — then hand Spark a fully defined task and force it to just execute.
Don't let Spark think. Let it build. The combined throughput is significantly higher than either model alone
solarkraft6 hours ago
> The narrative from AI companies hasn’t really changed, but the reaction has. The same claims get repeated so often that they start to feel like baseline reality, and people begin to assume the models are far more capable than they actually are.
This has been the case for people who buy into hype and don’t actually use the products, but I’m pretty sure people who do are pretty disillusioned by all the claims. The only somewhat reliable method is to test the things for your own use case.
That said: I always expected the tradeoff of Spark to be accuracy vs. speed. That it’s still significantly faster at the same accuracy is wild. I never expected that.
nvanlandschoot7 hours ago
Method: I used OpenAI’s published SWE-Bench Pro chart points and matched GPT-5.3-Codex-Spark to the baseline model at comparable accuracy levels by reasoning effort. At similar accuracy, the effective speedup is closer to ~1.37× rather than 15×.