I don't see why that would be true. As I understand, the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected. The predicted tokens could be consistently slightly worse, which could have a cascading effect to make the overall output a lot worse.
You can do exact verification, and as soon as a token mismatches you reject everything after that token from your draft. Relaxed acceptance techniques measure how wrong that mispredicted token is via some metric, and accept it if it’s close enough. So you get longer draft lengths with higher acceptance rates.
That's up to you, depends on how you implement it and how much you want to prioritize speed at the expense of quality, this is not an intrinsic attribute of speculative decoding. The verifier checks if the tokens predicted by the draft model are part of the top-k tokens predicted by the full size model at each steps. Set k to 1 and you will only accept perfect matches. Set k to > 1 and you will indeed start selecting "good enough" tokens, but will get faster inference.
But no matter what value you choose for k, the technique described in the article can apply and will result in faster inference at no loss when compared to a setup without this technique, with the same value of k.
The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.
Yes. This is because to generate token n+1 you need token n etc. So generating from scratch is a sequential (thus slow) process. When we verify tokens, we can, for each token, use all preceding tokens as input and check that the output token matches the expectation. But since the full sequence we want to verify already exist, we can do it in parallel for each token we want to verify and not sequentially.
This is why training transformer models is much faster than RNN, we do the same thing during training, it's just that the sequence we compare to is the ground truth and not coming from another model.
That said, I still think some providers are cheating. Please correct me if the test below is flawed.
I generated texts at temperature = 0 vs temperature = 2. At high temperature, the distributions effectively become flatter, meaning the difference between real and draft effective distributions (the D_LK used in theorem 3.5 of 2211.17192) becomes smaller. When T=2, the model speaks complete gibberish, so the effective distribution must be pretty flat. This should mean fewer rejections --> a lot faster speculative decoding. Yet, I see no increase in throughput at all...
However, if you have higher temperature but still are operating under a top-k sampling where k is small, not sure it's going to translate to any noticeable difference, since this will make your actual distributions very much non-uniform.
I didn't set a top-k. So it seems like Together must be doing something weird in their speculative decoding implementation.
IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.
Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.
So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.
Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.
TIL. Bit of an aha moment - never understood till now how a big model can verify faster than it can generate
A light-weight speculative model adapts to usage, keeping the acceptance rate for the static heavy-weight model within acceptable bounds.
Do they adapt with LoRAs?
Cool hack though, kudos. Wonder if they can make Groq or Cerebras do the same thing?
and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905
You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.
Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.
God I love OpenRouter.
I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code
At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.
Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.
[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance
AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.
Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.
What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...
OpenRouter numbers look fishy.
SambaNova should be similar...they've got a similar specialized hardware approach