That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic
I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.
Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...
DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.
Type IQ2_XXS / 183GB, 16k context:
CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.
CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.
I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.
> I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.
I run the Q2_K_XL and it's perfectly good for me. Where it lacks vs FP8 is in creative writing. If you prompt it with for a story a few times, then compare with FP8, you'll see what I mean.
For coding, the 1.58bit clearly makes more errors than the Q2XXS and Q2_K_XL
Requirements (>8 token/s):
380GB CPU Memory
1-8 ARC A770
500GB Disk
If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.
I’m guessing it’s under 10k.
I also didn’t see tokens per second numbers.
It’s a gimmick and not a real solution.
ChatGPT o3 mini high thinks at about 140 tokens/s by my estimation and I sometimes wish it can return answers quicker.
Getting a simple prompt answer would take 2-3 minutes using the AMD system and forget about longer context.
It's the same thing here. CPUs can run it but only as a gimmick.
No, that's not true.
I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute.
Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on.
Additionally, and perhaps more importantly, Arc is a GPU, not a CPU.
The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU"
Additionally, the HN headline includes "1 or 2 Arc 7700"
A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.
However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.
Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.
#1, should highlight it up front this time: We are talking about _G_PUs :)
#2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.
TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards
2) M3 Ultra can load Deepseek R1 671B Q4.
Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.
Can you share what LLMs do you run on such small devices/what user case they address?
(Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)
#1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.
#2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.
This app is Telosnex (telosnex.com).
I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.
I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.
You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.
However, things did pay off recently. How?
#1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting
#2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)
https://github.com/intel/ipex-llm/tree/main?tab=readme-ov-fi...
Any idea how many Arc's it takes to match an H100?
The only thing which doesn't work well is running on iGPUs. It might work but it's very unstable.
Huh? The largest vram card that Intel has is the A770 which is around $350. What exactly are you trying to compare against? Are you doing inference only or training?
DDR4 UDIMM is up to 32GB/module
DDR5 UDIMM is up to 64GB/module[0]
non-Xeon M/B has up to 4 UDIMM slots
-> non-Xeon is up to 128GB/256GB per node
Server motherboards have as many as 16 DIMM slots per socket with RDIMM/LRDIMM support, which allows more modules as well as higher capacity modules to be installed.0: there has been a 128GB UDIMM launch at peak COVID
Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.
You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.
Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.
https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...
https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...
https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...
Commonly you will also find configuration with two or three 'low profile' pci-express slots which have a different card height than the 'standard' height that most GPUs are built at.
that works for crypto because all the CPU sends to the card is the target ledger sha256sum (simplified) and the GPU generates nonces until the `sha256sum(sha256sum(nonce += ledger sum)` has however many zeros in the front. So until a card finds the correct nonce, or the server sends "new work" - a new ledger shasum, there's no traffic, really, between the GPU and the CPU. housekeeping, whatever, but not like 1GB/s!
But it's likely never going to work, too many driver, compatibility, requisite Kernel development, and power issues, to name a few. Probably cheaper in the end to just go buy 5090 and rant about CUDA.
That's Nvidia's current MO. There's more demand for GPUs for AI than there are GPUs available, and most of that demand still has stupid amounts of money behind it (being able to get grants, loans or investment based on potential/hype) - money that can be captured by GPU vendors. Unfortunately, VRAM is the perfect discriminator between "casual" and "monied" use.
(This is not unlike the "SSO tax" - single sign-on is pretty much the perfect discriminator between "enterprise use" and "not enterprise use".)
More than twice as fast as Nvidia 4090 for AI.
Launched last week.
Not in memory bandwidth which is all that matter for LLM inference.
Anyone have a link to this one?
This article is basically Intel saying remember us, we made a GPU! And they make great budget cards, but the ecosystem is just so far behind.
Honestly this is not something you can really do on a budget.
Why buy an overpriced Nvidia 4090 when you can get an AMD Halo Strix or Apple M3 Studio APU with 512GB or 128GB of Ram?
Nvidia has kept prices high and performance low for as long as it can and finally competition is here.
Even Intel can make APUs with tons of RAM.
Nvidia hopefully is squirming.
Also, it's probably better to link straight to the relevant section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...
My code is always perfect in my own eyes until someone else sees it.
But a portable (no install) way to run llama.cpp on intel GPUs is really cool.
Requirements:
380GB CPU Memory
1-8 ARC A770
500GB Disk
I think changing the end of headline to "Xeon w/380GB RAM" would stop it from being incorrect and misleading.
More GPUs let you keep more experts active at a time.
Edit: but what you added in your edit is right, it would be more accurate to append the system ram requirement
You might still be right since I have not confirmed that the selected experts change infrequently doing prompt processing / token generation, and someone could have botched the headline. However, treating Deepseek like llama 3 when reasoning about VRAM requirements is not necessarily correct.
[1] https://www.amazon.com/NEMIX-RAM-DDR4-2666MHz-PC4-21300-Redu...
The reason they used a Xeon is memory channels. Non-server CPUs only have 2 but modern Xeons have 8 to 12 depending on generation/type. And the Xeons with the most are the most $$$$ and it ends up cheaper to just get a GPU or dedicated accelerator.
It's a bit less exciting when you see they're just talking about offloading parts from the large amount of DRAM.
Also, LM studio lets you run smaller models in front of larger ones, so I could see having a few GPU in front really speeding up using R1 for inference.
DeepSeek employs multi-token prediction which enables self-speculative decoding without needing to employ a separate draft model. Or at least that's what I understood the value of multi-token prediction to be.
Two GPUs or more mean you can start to "keep" one or more of the experts hot on a GPU as well.
Ktransformers has a document about using CPU + a single 4090D to reach decent tokens/s but I'm not sure how much of the perf is due to the 4090D vs other optimizations/changes for the CPU side https://github.com/kvcache-ai/ktransformers/blob/main/doc/en... The final step of going to 6 experts instead of 8 feels like cheating (not a lossless optimization).
K (K=8 for these models, but you can customize that if you want) experts of 256 per layer are activated at a time. The 256 comes from the model file, it's just how many they chose to build it with. In these models there is also 1 shared expert which is always active in the layer. The router picks which k routed experts to use each forward pass and then a gating mechanism combines the outputs. If you sum the 1 shared expert + K routed experts + router + output networks you end up with 37 B parameters active for each feed forward layer pass. The individual experts are therefore much smaller than the total (probably something like 4 B parameters each? I've never really checked that directly).
Or, for the short answer: "37 B is the active parameters of 9 experts + 'overhead', not the parameters of a single expert".