The setup is parallel distill and refine. You start with parallel trajectories instead of one, then distill from them, and refine that to get to an answer. Instead of taking all trajectories to completion, they distill it quickly and refine so it gives outputs fast and yet smarter.
- paper came out in nov 2025
- three months is a good research to production pipeline
- one of the authors is at anthropic
- this approach will definitely burn more tokens than a usual simple run.
- > Anthropic explicitly warns that time to first token might still be slow (or even slower)
To what people are saying, speculative decoding wont be smarter or make any difference. Batching could be faster, but then not as costly.
Gemini Deepthink and gpt-5.2-pro use the same underlying parallel test time compute but they take each trajectory to completion before distilling and refining for the user.
You don't really need to fit the entire model on a single chip. Just as with GPUs, you can shard the model across multiple chips. Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly.
So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target. Given that Cerebras offers models much larger than 40B at faster speeds https://www.cerebras.ai/pricing#exploration GPT-5.3-Codex-Spark is likely closer to GLM 4.7 in size. (≈355B total parameters, 32B active)
This fact really should have given the author pause. It’s hard to take his any of his claims seriously in its face.
Importantly, Cerebras is offering many models that can't possibly fit on just a single chip, so they have to use some kind of sharding to get them to work at all. You could imagine an even bigger chip that can fit the entire model and run it even faster, but they have to work with what can be manufactured with current technology.
No, it only increases the latency, and does not affect the throughput.
For one thing, going chip-to-chip is not a faultless process and does not operate at the same speed as on-chip communication. So, yes, throughput can be reduced by splitting a computation across two chips of otherwise equal speed.
Edit: I see you’ re doing this further down; #thumbs up
With the exception of diffusion language models that don't work this way, but are very niche, language models are autoregressive, which means you indeed need to process token in order.
And that's why model speed is such a big deal, you can't just throw more hardware at the problem because the problem is latency, not compute.
Chaining chips does not decrease token throughput. In theory, you could run models of any size on Cerebras chips. See for example Groq's (not to be confused with Grok) chips, which only have 230 MB SRAM, yet manage to run Kimi K2.
If a layer completely fits in SRAM (as is probably the case for Cerebras), you only have to communicate the hidden states between chips for each token. The hidden states are very small (7168 floats for DeepSeek-V3.2 https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/c... ), which won't be a bottleneck.
Things get more complicated if a layer does not fit in SRAM, but it still works out fine in the end.
It's completely different during training because of the backward pass and weight update, which put a lot of strain on the inter-chip communication, but during inference even x4 PCIe4.0 is enough to connect GPUs together and not lose speed.
writer has not heard of continuous batching. this is no longer an issue. this is what makes claude code that affordable. https://huggingface.co/blog/continuous_batching
A good analogy? I wonder... how do buses work at your place? Do they wait to be at least half-full before departing? I used to do that in the Simutrans game!
Where I'm from, buses usually depart on schedule, whether you get on the bus or not...
[Edit:] Otherwise an insightful article I guess.
>The usefulness of AI agents is dominated by how few mistakes they make, not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model6.
That might be true today. I think the OpenAI-Cerebras partnership ultimately is going to lead to a paradigm shift because it will be possible to scale these chips up to the point where a model like full Codex-5.3 can run on them and then you'll have a super fast model that makes relatively few errors. A Codex-5.3 model running at these speeds is more than sufficient to actually start replacing customer facing jobs.
The world will be much more interesting when real bespoke hardware built for actual LLM usage comes to market. This means silicon of the SIMD flavour or other variants, but using DRAM so you can pack more tightly.
This is why the article's framing resonates with me. Speed helps when the bottleneck is actually inference latency, but in practice most agentic tasks spend the majority of their wall-clock time on tool execution — API calls, file I/O, waiting on external services. Making the model 6x faster doesn't help much when 70% of the time is spent in tool calls anyway.
The more interesting unlock from Cerebras-class hardware isn't raw speed for a single call, it's the ability to do speculative execution — run multiple candidate paths in parallel and pick the best one. That's where speed translates into accuracy rather than just latency.
If not then updates to the current models will become harder and harder
I'm happy to be wrong but I don't think it's batching improvements.
- the split pricing model makes it hard to tune model architecture for faster inference as you need to support fast and cheap versions.
- the faster the model is, the more it becomes a problem that they don’t ’understand’ time – they sit idle waiting for big compilations or they issue tools sequentially when they ought to have issued them in parallel.
In practice, humans perceive conversational pauses >800ms as awkward. So for a voice pipeline (STT → LLM inference → TTS), you have maybe 400-500ms budget for the LLM portion. At typical Sonnet speeds (~80 tok/s), you get ~35 tokens in that window — barely enough for a sentence. At Cerebras/Groq speeds (1000+ tok/s), you get 400+ tokens, which changes what's architecturally possible.
This is why the small-model vs. big-model tradeoff matters so much for real-time applications. We've found that a well-tuned smaller model with domain-specific context can outperform a larger model for constrained tasks (like navigating a user through a website or answering product questions), while staying within the latency budget. The "council" approach — multiple specialized small agents instead of one large general agent — lets you get both speed and quality.
The speculative decoding point is underrated here. For voice AI specifically, you can predict likely response patterns (greetings, confirmations, common Q&A) and pre-generate TTS for those, then only hit the full inference pipeline for novel queries. Gets you sub-200ms for ~60% of interactions.
(I think they might also be filling the message onto a GPU while you're typing over a websocket or something, but I'm not sure.)
This doesn't make sense.
1. Nvidia already sells e.g. the H100 with 80GB memory, so having 44GB isn't an advance, let alone a differentiator.
2. As I suspect anyone that's played with open weights models will attest, there's no way that 5.3-Codex-Spark is getting close to top-level performance and being sold in this way while being <44GB. Yes it's weaker and for sure it's probably a distil and smaller, but not by ~two orders of magnitude as suggested.
NVIDIA chips use HBM (High Bandwidth Memory) which is a form of DRAM - each bit is stored using a capacitor that has to be read and refreshed.
Most chips have caches on them built out of SRAM - a feedback loop of transistors that store each bit.
The big differences are in access time, power and density: SRAM is ~100 times faster than DRAM but DRAM uses much less power per gigabyte, and DRAM chips are much smaller per gigabyte of stored data.
Most processors have a few MB of SRAM as caches. Cerebras is kind of insane in that they’ve built one massive wafer-scale chip with a comparative ocean of SRAM (44GB).
In theory that gives them a big performance advantage over HBM-based chips.
As with any chip design though, it really isn’t that simple.
To address a large amount of SRAM requires an approximately log(N) amount of logic just to do the addressing (gross approximation). That extra logic takes time for a lookup operation to travel through, hence large = slow.
It’s also not one pool of SRAM. It’s thousands of small SRAM groups spread across the chip, with communication pathways in between.
So to have 44GB of SRAM is a very different architecture to 80GB of (unified) HBM (although even then that’s not true as most chips use multiple external memory interfaces).
HBM is high bandwidth. Whether that’s “fast” or not depends on the trade off between bandwidth and latency.
So, what I’m saying is this is way more complicated than it seems. But overall, yeah, Cerebras’ technical strategy is “big SRAM means more fast”, and they’ve not yet proven whether that’s technically true nor whether it makes economic sense.
I guess you meant to say they are fast because they are small?
The whole reason Cerebras can inference a model thousands of tokens per second is because it hosts the entire model in SRAM.
There are two possible scenarios for Codex Spark:
1. OpenAI designed a model to fit exactly 44GB.
2. OpenAI designed a model that require Cerebras to chain multiple wafer chips together; IE, an 88GB or 132GB or 176GB model or more.
Both options require the entire model to fit inside SRAM.
The real reason which batching increases latency is multi-factored and more complex to explain.
When an author is confused about something so elementary, I can’t trust anything else they write.
Reality is more complex. As context length grows your KV cache becomes large and will begin to dominate your total FLOPs (and hence bytes loaded). The issue with KV cache is you cannot batch it because only one user can use it, unlike static layer weights where you can reuse them across multiple users.
Emerging sparse attention techniques can greatly relieve this issue though the extent to which frontier labs deploy them is uncertain. Deepseek v3.2 uses sparse attention though I don't know off hand how much this reduces KV cache FLOPs and associated memory bandwidth.
This is not really correct given how input token caching works and the reality of subagent workloads. You could launch many parallel subagents sharing some portion of their input tokens and use batching for that task.
1. Parallel investigation : the payoff form that is relatively small - starting K subagents assumes you have K independent avenues of investigation - and quite often that is not true. Somewhat similar to next-turn prediction using a speculative model - works well enough for 1 or 2 turns, but fails after.
2. Input caching is pretty much fixes prefill - not decode. And if you look at frontier models - for example open-weight models that can do reasoning - you are looking at longer and longer reasoning chains for heavy tool-using models. And reasoning chains will diverge very vey quickly even from the same input assuming a non-0 temp.
Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate.
My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.
Seems like nonsense to me.
OpenAI and Cerebras have been working together at some level for nearly a decade.
Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).
But my money is on the exact two mechanisms the OP proposes.
It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.
Yeah they can’t tell, but also there’s lots of incentive for major LLM providers to lie about not doing something that would massively save their inference costs if they did.
It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost