Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...
But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.
Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...
So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.
Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.
If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.
I’m not an expert on LLMs, just a user.
Checking a token is the same as generating it.
The benefit however is in the next (third) token. After generating tokens 1 and 2 (in one turn), you start generating token 3 (and 4). You also get the “real” prediction for token 2. If the “real” prediction matches the MTP (Multi-Token Prediction) from previous turn, you have just generated 3 correct tokens (and another speculative). If not, you’ve now corrected token 2, but token 3 is wrong (it follows the wrong token 2) so you need ti generate it again.
[1] https://en.wikipedia.org/wiki/Speculative_execution
Does it work to predict tokens 3 and 4 (or 5, 6) in the same way? I wonder how extreme the hit rate drop-off is.
Let's say you have a model that generates the string "The 44th president of the United States is ___ ___". Your model will generate "Barack" as the n+1 token, and the MTP layer probably does a good enough job to generate "Obama" as the n+2 token (even though that MTP layer is a mere <2bil parameters in size). Then you just check if "Obama" is correct via the same speculative decoding process, which is a lot faster than if you had to start over from layer 1-48 and generate "Obama" the regular way.
That doesn't match my understanding of what speculative decoding does: AFAIK with regular speculative decoding you ask a smaller llm infer the next few tokens (let say 5 tokens) and then, you can have the big model infer token 1, 2, 3, 4, 5 and 6 in parallel (each time starting from the sentence partially completed by the smaller model). Because llms are bandwidth bound, doing the same work six times in parallel isn't slower than doing it only once (what's costly is moving the massive model weights between VRAM and the GPU cores).
If token 1,2 and 3 match what the small models inferred, then you keep them. As soon as you have a mismatched token (say token 4) it means that you have to discard the next inferred tokens (here token 5 and 6) because they were calculated under a wrong assumption for token 4.
So if the MTP layer merely replace the smaller llm in the previous scheme with everything else working the same way, you would save anything when inferring “Obama” (you'd still need to “generate it the regular way”, as there isn't really another way) but you could also start working on the word immediately after “Obama” by assuming “Obama” was already chose. And if the model actually outputted “Hussein” instead of “Obama”, then the token calculated to happen after “Obama” would have to be discarded.
Or maybe my understanding of speculative decoding is completely off…
If n+1 = "Barack" then n+2 = "Obama" (confidence: 0.90) If n+1 = "The" then n+2 = "quick" (confidence: 0.45) If n+1 = "President" then n+2 = "Biden" (confidence: 0.75)
A threshold is set (say, as 90%) so that if the n+2 prediction is above that (as in the first example) it uses it without having to determine it with the main model. It's confident "enough".
You compute the next token and guess the one after; then you try to take the guess for real and compute the one after together with running inference for the guessed one, and the one after is speculated on the guess being correct.
It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ...
and other helpers like Artem Kirsanov:
Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!
I don't think it move this fast.
I mean there is very little fundamental differences between GPT-2 and gpt-oss-120b, it's just about incremental improvement that don't change much to the full picture (using a variation of the attention architecture and masking, a different activation function, the positional encoding and changing the NLP layers to a sparse “mixture of expert”), at the end of the day, from Mistral to Deepseek going through llama and Qwen3 it's always the same stack of transformers layers with slight variations between two architectures.
This Qwen3-Next is special though, as it's the first time a major player is releasing something that different (lesser players have made hybrid architecture LLMs for the past two years, but when it comes to language models, IBM really isn't comparable to Alibaba). This is what I expected Llama4 to be.
LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).
---
When training, you make an LLM learn that
I use arch = downscale(upscale(I use))
If you want to predict the next word after that, you do next in sequence the following:
I use arch btw = downscale(upscale(I use arch))
Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.
i.e in parallel:
I use arch = downscale1(upscale(I use))
I use ____ btw = downscale2(upscale(I use))
However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.
What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.
So overall, you save params.
Concretely,
Before: downscale1.params + downscale2.params
After: downscale_common.params + lightweight1.params + lightweight2.params
Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.
I use ____ ___ = downscale_common(lightweight1(.)) + downscale_common(lightweight2(.)) ?
And does it generate 2 at a time and keep going that way, or is there some overlap?If you try to predict the whole thing at once you might end up with
I like cats because they are all the rats and they garden
> Overlap
Check out an inference method called self-speculative decoding which solves(somewhat) the above problem of k-token prediction, which does overlap the same ___ across multiple computations.
Qwen3-Next — A family of large language models from Qwen (Alibaba).
DeepSeek R1 — Another large open-source language model from DeepSeek AI.
Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.
MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.
Embedding — Converts words/tokens into vectors (numbers) the model can work with.
Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.
embed_tokens — The big lookup table of embeddings (token → vector).
shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.
[129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).
FP8 — Floating-point format using 8 bits (compact, faster, less precise).
Active parameters — The weights that actually need to be loaded in GPU memory to run the model.
Inference — Running the model to generate text (as opposed to training it).
GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.
I just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysis
According to Qwen Chat, Qwen3-Next has the following limits:
Maximum context length: 262,144 tokens
Max summary generation length: 32,768 tokens
This is 2x higher on context length and 4x higher on summary generation compared to Qwen3-235B-A22B, damn
> Qwen3-Next [...] excels in ultra-long-context understanding and complex tasks
Even though their new hybrid architecture is fascinating, I think I'll continue to stick with Qwen2.5-Turbo because it's one of the few models that supports 1M tokens in context length. My use case is uploading large pdfs and ask questions across chapters
I still see the model becoming more intoxicated as turn count gets high.
please don't add any comments in the code unless explicitly asked to, including the ones that state what you changed. do not modify/remove any existing comments as long as they are valid. also output the full files that are changed (not the untouched ones), and no placeholders like "no change here" etc. do not output the xml parts in the output.xml file. focus on the individual files. before and after outputting code, write which file it would be and the path (not as a comment in the code but instead, before and after outputting code).
Attached is a 400k token xml file, being the output of:
https://pastebin.com/raw/SH6JHteg
Main prompt is a general description of the feature needed and PDF exports from figma.
All done for free in aistudio and I consistently get better results than the people using claude code.
> Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method.
Source: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct#proc...
I read the article, but as I said Qwen chat only provides up to 262k tokens in context length, so I'll stick with Qwen2.5 Turbo which supports 1M tokens.
I am not in a position where I can self-host yet
I do sometimes chop up the PDF into smaller pdfs with their own individual chapters
Here's a classic ASCII art representation of SpongeBob SquarePants:
.------.
/ o o \
| |
| \___/ |
\_______/
llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
"An ASCII of spongebob"
Here's an ASCII art of SpongeBob SquarePants:
.--..--..--..--..--..--.
.' \ (`._ (_) _ \
.' | '._) (_) |
\ _.')\ .----..--.' /
|(_.' | / .-\-. \---.
\ 0| | ( O| O) | |
| _ | .--.____.'._.-. |
\ (_) | o -` .-` |
| \ |`-._ _ _ _ _\ /
\ | | `. |_||_| |
| o | \_ \ | -. .-.
|.-. \ `--..-' O | `.`-' .'
_.' .' | `-.-' /-.__ ' .-'
.' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
`-._ `. |________/\_____| `-.'
.' ).| '=' '='\/ '=' |
`._.` '---------------'
//___\ //___\
|| ||
||_.-. ||_.-.
(_.--__) (_.--__)
Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.
A scuffed but fully original ASCII SpongeBob is usually more valuable than a perfect recall of an existing one.
One major issue with highly sparse MoE is that it appears to advance memorization more than it advances generalization. Which might be what we're seeing here.
The larger model already has it in the training corpus so it's not particularly a good measure though. I'd much rather see the capabilities of a model in trying to represent in ascii something that it's unlikely to have in it's training.
Maybe a pelican riding a bike as ascii for both?
And that is also exactly how we want them not to work: we want them to be able to solve new problems. (Because Pandora's box is open, and they are not sold as a flexible query machine.)
"Where was Napoleon born": easy. "How to resolve the conflict effectively": hard. Solved problems are interesting to students. Professionals have to deal with non trivial ones.
speak for yourself, I like solving problems and I'd like to retire before physical labor becomes the only way to support yourself
> they are not sold as a flexible query machine
yeah, SamA is a big fucking liar
llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
"An ASCII of spongebob"
Here's an ASCII art of SpongeBob SquarePants:
```
.--..--..--..--..--..--.
.' \ (`._ (_) _ \
.' | '._) (_) |
\ _.')\ .----..--. /
|(_.' | / .-\-. \
\ 0| | ( O| O) |
| _ | .--.____.'._.-.
/.' ) | (_.' .-'"`-. _.-._.-.--.-.
/ .''. | .' `-. .-'-. .-'"`-.`-._)
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
.'.' | | | | | | | | | |
```
Humans do it too. I have given up on my country's non-local information sources, because I could recognize original sources that are being deliberately omitted. There's a satiric webpage that is basically a reddit scrape. Most of users don't notice and those who do, don't seem to care.
.--..--..--..--..--..--.
.' \ (`._ (_) _ \
.' | '._) (_) |
\ _.')\ .----..---. /
|(_.' | / .-\-. \ |
\ 0| | ( O| O) | o|
| _ | .--.____.'._.-. |
\ (_) | o -` .-` |
| \ |`-._ _ _ _ _\ /
\ | | `. |_||_| |
| o | \_ \ | -. .-.
|.-. \ `--..-' O | `.`-' .'
_.' .' | `-.-' /-.__ ' .-'
.' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
`-._ `. |________/\_____| `-.'
.' ).| '=' '='\/ '=' |
`._.` '---------------'
//___\ //___\
|| ||
||_.-. ||_.-.
(_.--__) (_.--__)
Enjoy your SpongeBob ASCII!And unless their terminal details are included in the context, they'll just have to guess.
In fact everything from ' ' to ' '79 all have a single token assigned to them on the OpenAI GPT4 tokenizer. Sometimes ' 'x + '\n' is also assigned a single token.
You might ask why they do this but its to make it so programming work better by reducing token counts. All whitespace before the code gets jammed into a single token and entire empty lines also get turned into a single token.
There are actually lots of interesting hand crafted token features added which don't get discussed much.
With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.
Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:
Model: chutes/Qwen/Qwen3-235B-A22B
/~\
( * * )
( o o o )
\ - /
\ /\ /
\ /
\/
/|||\
/|||||\
/||||||||\
( o o o )
\ W /
\___/
Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507 /\_/\
( o.o )
> ^ <
/ \
| |
| |
\ /
'-'-'
Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507 .-----------.
| [] [] |
| |
| __ __ |
| | || | |
| |__||__| |
| |
'-----------'
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct __
/ \
/ \
/ \
/ \
/__________\
| o o |
| __ |
| \____/ |
| |
| ________ |
| \ / |
| \____/ |
| |
|____________|
___________
/ \
/ \
| _______ |
| | | |
| | ___ | |
| | | | | |
| | |___| | |
| |_______| |
| |
|_______________|
Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking .-.
/ \
| o o|
| > |
| ---|
\___/
Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507 _________________________
/ \
| _ _ _ _ |
| / \ / \ / \ / \ |
| | | | | | | | | |
| \_/ \_/ \_/ \_/ |
| |
| _ _ _ _ |
| / \ / \ / \ / \ |
| | | | | | | | | |
| \_/ \_/ \_/ \_/ |
| |
| SpongeBob SquarePants |
|_________________________|
Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.
Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.
https://artificialanalysis.ai/#frontier-language-model-intel...
Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.
In essence, it was Deepseek R1 before Deepseek R1.
> dense
> 405B model
Llama4 does not match any of these details. Maybe the commenter thinks their comment is about Llama4 (I don't see a reason to believe so) but readers familiar with these details know they are referring to Llama3.1.
Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...
Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.
There are so many things you can do with long running, continuous inference.
Case in point; I'd like something that realtime assesses all the sensors and API endpoints of stuff in my home and as needed bubbles up summaries, diaries, and emergency alerts. Right now that's probably a single H200, and well out of my "value range". The number of people in the world that do this now at scale is almost certainly less than 50k.
If that inference cost went to 1%, then a) I'd be willing to pay it, and b) there'd be enough of a market that a company could make money integrating a bunch of tech into a simple deployable stack, and therefore c) a lot more people would want it, likely enough to drive more than 50k H200s worth of inference demand.
Why can't you build this today?
[0]: https://arxiv.org/pdf/2506.02153 Small Language Models are the Future of Agentic AI (Nvidia)
It doesn’t sound like you need to run a H200 to bridge the gap between what currently exists and the outcome you want.
edit: this reminds me of a state agency I once worked for who fired their only IT guy after they moved offices, because the servers were running just fine without him. It was a Kafkaesque trauma for him for a moment, but a massive raise a week later when they were renegotiating for him to come back.
Is that true? BLS estimates of customer service reps in the US is 2.8M (https://www.bls.gov/oes/2023/may/oes434051.htm), and while I'll grant that's from 2023, I would wager a lot that the number is still above 2M. Similarly, the overwhelming majority of software developers haven't lost their jobs to AI.
A sufficiently advanced LLM will be able to replace most, if not all of those people. Penetration into those areas is very low right now relative to where it could be.
In any case, AI is not capable of fully replacing customer care. It will make it more efficient but the non-deterministic nature of LLMs mean that they need to be supervised for complex cases.
Besides, I still think even the inference demand for customer care or programming will be small in the grand scheme of things. EVERY Google search (and probably every gmail email) is already passed through an LLM - the demand for that alone is immense.
I'm not saying demand won't increase, I just don't see how demand increases so much that it offsets the efficiency gains to such an extent that Oracle etc are planning tens or hundreds of times the need for compute in the next couple of years. Or at least I am skeptical of it to say the least.
There are plenty of usecases where the models are not smart enough to solve the problem yet, but there is very obviously a lot of value available to be harvested from maturing and scaling out just the models we already have.
Concretely, the $200/mo and $2k/ mo offerings will be adopted by more prosumer and professional users as the product experience becomes more mature.
Besides, this would only apply for very few use cases. For a lot of basic customer care work, programming, quick research, I would say LLMs are already quite good without running it 100X.
The compute/intelligence curve is not a straight line. It's probably more a curve that saturates, at like 70% of human intelligence. More compute still means more intelligence. But you'll never reach 100% human intelligence. It saturates way below that.
Thinking it will go beyond human limits is just wishful thinking at this point. There is no reason to believe it.
Whatever is good enough now, can be much better for the same cost (time, computation, actual cost). People will always choose better over worse.
If you come up with a way to make the current generation of models 10x more efficient, then everyone just moves to train a 10x bigger model. There isn’t a size of model where the players are going to be satisfied at and not go 10x bigger. Not as long as scaling still pays off (and it does today).
Every time a new model is released, people abandon the old, lower quality model (even when it’s priced less), and instead prefer to pay the same for a better model.
The same will happen with this.
If that happened at 10x the speed, I would still be slow in computer terms, and that increasingly matter, because I will not be the one reading the stuff – it will be other computers. I think looking back a few years from now, every single piece of silicon that is planned right will look like a laudable but laughable drop in the ocean.
(A string example read today of Real quality demand needs: the administration of Albania wants some sort of automated Cabinet Minister. Not just an impartial and incorruptible algorithm (what we normally try to do with deterministic computation): a "minister". Good luck with that.)
Full set of open weight model results: https://brokk.ai/power-ranking?version=openround&models=ds-r...
My real-world usage does not line up with these results, but I'm not working with Java.
That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.
- in math, if they can solve a problem, or a class of problems, they'll solve it. If you use a "thinking" model + maj@x, you'll get strong results. But if you try for example to have the model consider a particular way or method of exploring a problem, it'll default to "solving" mode. It's near impossible to have it do something else with a math problem, other than solving it. Say "explore this part, in this way, using this method". Can't do it. It'll maybe play a bit, but then enter "solving" mdoe and continue to solve it as it was trained.
In practice, this means that "massive parallel" test time compute becomes harder to do with these models, because you can't "guide" them towards certain aspects of a problem. They are extremely "stubborn".
- in coding it's even more obvious. Ask them to produce any 0shot often tested and often shown things (spa, game, visualisation, etc) - and they do it. Convincingly.
But ask them to look at a piece of code and extract meaning, and they fail. Or ask them to reverse an implementation. Figure out what a function does and reverse its use, or make it do something else, and they fail.
It does sound like an artifact of the dialog/thinking tuning though.
I can't even begin to understand what that would mean.
I use ollama every day for spam filtering: gemma3:27b works great, but I use gpt-oss:20b on a daily basis because it's so much faster and comparable in performance.
This is pretty impressive and a bit like how the GPT-OSS-120B came out and scored pretty well on the benchmarks despite its somewhat limited size.
That said, using LLMs for software dev use cases, I wouldn't call 256K tokens "ultra-long" context, I regularly go over 100K when working on tasks with bigger scope, e.g.:
Look at the existing code related to this functionality and the existing design patterns in the code as well as the guidelines.
Then plan out the implementation in detail and ask me a few questions along the way to figure the details out better.
Finally, based on everything so far, do the actual implementation.
Then look it over and tell me if anything has been missed from the plan, then refactor the code in any number of ways.
It could be split up into multiple separate tasks, but I find that the context being more complete (unless the model starts looping garbage, which poisons the context) leads to better results.My current setup of running Qwen3 Coder 480B on Cerebras bumps into the 131K token limit. If not for the inference speed there (seriously great) and good enough model quality, I'd probably look more in the direction of Gemini or Claude again.
This stuff can run on a local machine without internet access, correct?
And it can pretty much match Nano Banana? https://github.com/PicoTrex/Awesome-Nano-Banana-images/blob/...
Also -- what are the specs for a machine to run it (even if slowly!)
This has nothing to do with nano banana, or image generation. For that you want the qwen image edit[1] models.
the model discussed here is text model, so similar to ChatGPT. You can also run it on your local machine, but not yet, as apps need to be updated with Qwen 3 Next support (llama.cpp, Ollama, etc)
Yes.
> And it can pretty much match Nano Banana?
No, Qwen3-Next is not a multimodal model, it has no image generation function.
Make sure to lurk on r/LocalLlama.
Please do take everything you read there with a bit of salt though, as the "hive-mind" effect is huge there, even when compared to other subreddits.
I'm guessing the huge influx of money + reputations on the line + a high traffic community is ripe for both hive-minding + influence campaigns.
What will the actual next advanced release be called:
* next-next
* next (2)
* actual-next-final
I'm skeptical about these claims. How can this be? Wouldn't there be massive loss of world knowledge? I'm particularly skeptical because a recent trend in Q2 2025 has been benchmaxxing.
More efficient architecture.
> Wouldn't there be massive loss of world knowledge?
If you assume equally efficient architecture and no other salient differences, yes, that’s what you’d expect from a smaller model.
I recommend playing with the free hosted models to draw your own conclusions: https://chat.qwen.ai/
It's amazing how far and how short we've come with software architectures.
But in practice you need a bit more than that. You also need some space for context, and then for kv cache, potentially a model graph, etc.
So you'll see in practice that you need 20-50% more RAM than this rule of thumb.
For this model, you'll need anywhere from 50GB (tight) to 200GB (full) RAM. But it also depends how you run it. With MoE models, you can selectively load some experts (parts of the model) in VRAM, while offloading some in RAM. Or you could run it fully on CPU+RAM, since the active parameters are low - 3B. This should work pretty well even on older systems (DDR4).
That being said, there are libraries that can load a model layer by layer (say from an ssd) and technically perform inference with ~8gb of RAM, but it'd be really really slow.
It's really not that much code, though, and all the actual capabilities are there as of about mid this year. I think someone will make this work and it will be a huge efficiency for the right model/workflow combinations (effectively, being able to run 1T parameter MoE models on GB200 NVL4 at "full speed" if your workload has the right characteristics).
LM Studio defaults to 12/36 layers on the GPU for that model on my machine, but you can crank it to all 36 on the GPU. That does slow it down but I'm not finding it unusable and it seems like it has some advantages - but I doubt I'm going to run it this way.
What actually happens is you run some or all of the MoE layers on the CPU from system RAM. This can be tolerable for smaller MoE models, but keeping it all on the GPU will still be 5-10x faster.
I'm guessing lmstudio gracefully falls back to running _soemthing_ on the CPU. Hopefully you are running only MoE on the CPU. I've only ever used llama.cpp.
KV Cache in GPU and 36/36 layers in GPU: CPU usage under 3%.
KV Cache in GPU and 35/36 layers in GPU: CPU usage at 35%.
KV Cache moved to CPU and 36/36 layers in GPU: CPU usage at 34%.
I believe you that it doesn't make sense to do it this way, it is slower, but it doesn't appear to be doing much of anything on the CPU.
You say gigabytes of weights PER TOKEN, is that true? I think an expert is about 2 GB, so a new expert is 2 GB, sure - but I might have all the experts for the token already in memory, no?
I don't know how lmstudio works. I only know the fundamentals. There is not way it's sending experts to the GPU per token. Also, the CPU doesn't have much work to do. It's mostly waiting on memory.
- Prompt processing 65k tokens: 4818 tokens/s
- Token generation 8k tokens: 221 tokens/s
If I offload just the experts to run on the CPU I get:
- Prompt processing 65k tokens: 3039 tokens/s
- Token generation 8k tokens: 42.85 tokens/s
As you can see, token generation is over 5x slower. This is only using ~5.5GB VRAM, so the token generation could be sped up a small amount by moving a few of the experts onto the GPU.
And it appears like it's thinking about it! /s
The APIs are not subsidized, they probably have quite the large margin actually: https://lmsys.org/blog/2025-05-05-large-scale-ep/
>Why would you pay OpenAI when you can host your own hyper efficient Chinese model
The 48GB of VRAM or unified memory required to run this model at 4bits is not free either.