Otherwise, if you have a GPU with more than like 4GB of VRAM, there are better models. Gemma4 and Qwen3.6 (or Qwen3.5 if you need the smaller dense models that haven't yet been released for 3.6) are a good place to start.
What are you using for inference? I have a recent intel laptop with 32GB of DDR5 and I am getting at most 25tps with the llama cpp vulkan backend (that is the fastest, I also tried sycl but it is a bit slower)
Prediction Stats:
Stop Reason: eosFound
Tokens/Second: 21.10
Time to First Token: 1.827s
Prompt Tokens: 42
Predicted Tokens: 187
Total Tokens: 229The only thing I'm not sure about is if this model supports thinking or not.
It can be assumed that this larger model takes more time to complete post-training, but it will follow in the near future after those smaller LFM2.5 models.
I find Gemmas really good for a short conversation with maybe 3 or 4 exchanges of a few paragraphs each, which covers a surprisingly large amount of interactions.
For anything longer form though, particularly with larger code contexts, Qwen is far more useful for me personally.
I'm not an expert in this field, but my understanding is Qwen are hybrid gated attention mechanisms, whereas Gemma is hybrid including a sliding attention attention mechanism which makes it look like it favour the most recent tokens a little too much at times.
This is all in the context of local quantized models, I'm aware both have larger cloud variants that wouldn't suffer as much.
- GPQA Diamond: 47.4% vs 84.1% for Qwen
- HLE: 4.4% vs 20.2% for Qwen
- AA Omniscience Accuracy: 6.4% vs 18.9% for Qwen
- AA Hallucination Rate: 30.0% vs 50.3% for Qwen