296 pointsby colorant4 days ago22 comments
  • hnfong3 days ago
    As other commenters have mentioned, the performance of this set up is probably not really great since there's not enough VRAM and lots of bits have to be moved between CPU and GPU RAM.

    That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic

    I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

    Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...

    DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.

    • SlavikCA3 days ago
      I tried that Unsloth R1 quantization on my dual Xeon Gold 5218 with 384 GB DDR4-2666 (about half of memory channels used, so not most optimal).

      Type IQ2_XXS / 183GB, 16k context:

      CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.

      CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.

      I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.

    • idonotknowwhy3 days ago
      Thanks a lot for the v2.5! I'll give that a whirl. Hopefully it's as coherent as v3.5 when quantized so small.

      > I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

      I run the Q2_K_XL and it's perfectly good for me. Where it lacks vs FP8 is in creative writing. If you prompt it with for a story a few times, then compare with FP8, you'll see what I mean.

      For coding, the 1.58bit clearly makes more errors than the Q2XXS and Q2_K_XL

    • colorant3 days ago
      Currently >8 token/s; there is a demo in this post: https://www.linkedin.com/posts/jasondai_run-671b-deepseek-r1...
    • pinoy4203 days ago
      [dead]
  • colorant4 days ago
    https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

    Requirements (>8 token/s):

    380GB CPU Memory

    1-8 ARC A770

    500GB Disk

    • colorant4 days ago
      • aurareturn3 days ago
        CPU inference is both bandwidth and compute constrained.

        If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.

        • colorant3 days ago
          Prompt length mainly impacts prefill latency (FTFF), not the decoding speed (TPOT)
          • moffkalast3 days ago
            Decoding speed won't matter one bit if you have to sit there for 5 minutes waiting for the model to ingest a prompt that's two sentences long.
    • GTP3 days ago
      > 1-8 ARC A770

      To get more than 8 t/s, is one Intel Arc A770 enough?

      • colorant3 days ago
        Yes, but the context length will be limited due to VRAM constraint
    • faizshah3 days ago
      Anyone got a rough estimate of the cost of this setup?

      I’m guessing it’s under 10k.

      I also didn’t see tokens per second numbers.

      • ynniv3 days ago
        • aurareturn3 days ago
          This article keeps getting posted but it runs a thinking model at 3-4 tokens/s. You might as well take a vacation if you ask it a question.

          It’s a gimmick and not a real solution.

          • hnuser1234563 days ago
            If you value local compute and don't need massive speed, that's still twice as fast as most people can type.
            • aurareturn3 days ago
              Human typing speed is magnitudes slower than our eyes scanning for the correct answer.

              ChatGPT o3 mini high thinks at about 140 tokens/s by my estimation and I sometimes wish it can return answers quicker.

              Getting a simple prompt answer would take 2-3 minutes using the AMD system and forget about longer context.

            • evilduck3 days ago
              Reasoning models spend a whole bunch of time reasoning before returning an answer. I was toying with QWQ 32B last night and ran into one question I gave it where it spent 18 minutes at 13tok/s in the <think> phase before returning a final answer. I value local compute but reasoning models aren’t terribly feasible at this speed since you don’t really need to see the first 90% of their thinking output.
          • miklosz3 days ago
            Exactly! I run it on my old T7910 Dell workstation (2x 2697A V4, 640GB RAM) that I build for way less than a $1k. But so what, it's about ~2 tokens / s. Just like you said, it's cool that it's run at all, but that's it.
          • walrus013 days ago
            It's meant to be a test/development setup for people to prepare the software environment and tooling for running the same on more expensive hardware. Not to be fast.
            • aurareturn3 days ago
              I remember people trying to run the game Crysis using CPU rendering. They got it to run and move around. People did it for fun and the "cool" factor. But no one actually played the game that way.

              It's the same thing here. CPUs can run it but only as a gimmick.

              • refulgentis3 days ago
                > It's the same thing here. CPUs can run it but only as a gimmick.

                No, that's not true.

                I work on local inference code via llama.cpp, on both GPU and CPU on every platform, and the bottleneck is much more ram / bandwidth than compute.

                Crappy Pixel Fold 2022 mid-range Android CPU gets you roughly same speed as 2024 Apple iPhone GPU, with Metal acceleration that dozens of very smart people hack on.

                Additionally, and perhaps more importantly, Arc is a GPU, not a CPU.

                The headline of the thing you're commenting on, the very first thing you see when you open it, is "Run llama.cpp Portable Zip on Intel GPU"

                Additionally, the HN headline includes "1 or 2 Arc 7700"

                • aurareturn3 days ago
                  It's both compute and bandwidth constrained - just like trying to run Crysis on CPU rendering.

                  A770 has 16GB of RAM. You're shuffling data to the GPU at a rate of 64GB/s, which is magnitudes slower than the internal VRAM of the GPU. Hence, this setup is memory bandwidth constrained.

                  However, once you want to use it to do anything useful like a longer context size, the CPU compute will be a huge bottleneck for time-to-first-token as well as tokens/s.

                  Trying to run a model this large, and a thinking one at that, on CPU RAM is a gimmick.

                  • refulgentis3 days ago
                    Okay, let's stipulate LLMs are compute and bandwidth sensitive (of course!)...

                    #1, should highlight it up front this time: We are talking about _G_PUs :)

                    #2 You can't get a single consumer GPU that has enough memory to load a 670B parameter model, there's some magic going on here. It's notable and distinct. This is probably due to FlashMoE, given it's prominence in the link.

                    TL;Dr: 1) these are Intel _G_PUs, and 2) it is a remarkable distinct achievement to be loading a 670B parameter model on only one to two cards

                    • aurareturn3 days ago
                      1) This system mostly uses normal DDR RAM, not GPU VRAM.

                      2) M3 Ultra can load Deepseek R1 671B Q4.

                      Using a very large LLM across the CPU and GPU is not new. It's been done since the beginning of local LLMs.

                • xoranth3 days ago
                  > Crappy Pixel Fold 2022 mid-range Android CPU

                  Can you share what LLMs do you run on such small devices/what user case they address?

                  (Not a rhetorical question, it's just that I see a lot of work on local inference for edge devices with small models, but I could never get a small model to work for me. So I'm curious about other people's user cases.)

                  • refulgentis3 days ago
                    Excellent and accurate q. You sound like the first person I've talked to who might appreciate full exposition here, apologies if this is too much info. TL;DR is you're def not missing anything, and we're just beginning to turn a corner and see some rays of light of hope, where it's a genuine substitute for remote models in consumer applications.

                    #1) I put a lot of effort into this and, quite frankly, it paid off absolutely 0 until recently.

                    #2) The "this" in "I put a lot of effort into this", means, I left Google 1.5 years ago and have been quietly building an app that is LLM-agnostic in service of coalescing a lot of nextgen thinking re: computing I saw that's A) now possible due to LLMs B) was shitcanned in 2020, because Android won politically, because all that next-gen thinking seemed impossible given it required a step change in AI capabilities.

                    This app is Telosnex (telosnex.com).

                    I have a couple stringent requirements I enforce on myself, it has to run on every platform, and it has to support local LLMs just as well as paid ones.

                    I see that as essential for avoiding continued algorithmic capture of the means of info distribution, and believe on a long enough timeline, all the rushed hacking people have done to llama.cpp to get model after model supported will give away to UX improvements.

                    You are completely, utterly, correct to note that the local models on device are, in my words, useless toys, at best. In practice, they kill your battery and barely work.

                    However, things did pay off recently. How?

                    #1) llama.cpp landed a significant opus of a PR by @ochafik that normalized tool handling across models, as well as implemented what the models need individually for formatting

                    #2) Phi-4 mini came out. Long story, but tl;dr: till now there's been various gaping flaws with each Phi release. This one looked absent of any issues. So I hack support for its tool vagaries on top of what @ochafik landed, and all of a sudden I'm seeing the first local model sub-Mixtral 8x7B that's reliably handling RAG flows (i.e. generate search query, then, accept 2K tokens of parsed web pages and answer a q following directions I give you) and tool calls (i.e. generate search query, or file operations like here: https://x.com/jpohhhh/status/1897717300330926109)

        • utopcell3 days ago
          What a teaser article! All this info for setting up the system, but no performance numbers.
          • yvdriess3 days ago
            That's because the OP is linking to the quickstart guide. There are benchmark numbers on the github's root page, but it does not appear to include the new deepseek yet:

            https://github.com/intel/ipex-llm/tree/main?tab=readme-ov-fi...

            • utopcell3 days ago
              Am I missing something ? I see a lot of the small-scale models results but no results for DeepSeek-R1-671B-Q4_K_M on their github repos.
    • 3 days ago
      undefined
  • jamesy0ung4 days ago
    What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?
    • VladVladikoff4 days ago
      I think it’s that most non Xeon motherboards don’t have the memory channels to have this much memory with any sort of commercially viable dimms.
      • genewitch4 days ago
        Pcie lanes
        • hedora3 days ago
          I was about to correct you because this doesn't use PCIe for anything, and then I realized Arc was a GPU (and they support up to 8 per machine).

          Any idea how many Arc's it takes to match an H100?

          • npodbielski3 days ago
            I am reading from time to time about multi GPU solution and last time I found some real life information about this (it was two 7900 xtx) the result was that performance is the same at best often it is slower. So even if you manage to slap like 8 cheap cards onto motherboard, even if you would somehow make it work (people have problems with such setups), even if this would work continuously without much problems (crashes, power consumption) performance would be just OK. I am not sure if spending 10k on such setup would be better than buying 10k card with 40gbs of RAM.
            • pshirshov3 days ago
              Ollama works fine with multi-gpu setups. Since rocm 6.3 everything is stable and you can mix different GPU generations. The performance is good enough for the models to be useful.

              The only thing which doesn't work well is running on iGPUs. It might work but it's very unstable.

              • npodbielski3 days ago
                Good to know. Still, is it viable option? Buying i.e. AMD Threadripper for 2.5k$, motherboard and ram for 2k$ and i.e. 4 GPUs for 4k$ to have total of 96GB of VRAM? Total should be around 10k$ which is roughly price of Intel GPU specifically for AI if I am not mistaken? Which option would be better performance wise. I did never saw a comparison anywhere and this is too much money to make fun experiment over the weekend.
                • zargon3 days ago
                  > 10k$ which is roughly price of Intel GPU specifically for AI

                  Huh? The largest vram card that Intel has is the A770 which is around $350. What exactly are you trying to compare against? Are you doing inference only or training?

                  • genewitch3 days ago
                    It can be read as a typo for "Nvidia" as their cards are about that price (or more, i haven't looked)
    • numpad03 days ago

        DDR4 UDIMM is up to 32GB/module  
        DDR5 UDIMM is up to 64GB/module[0]  
        non-Xeon M/B has up to 4 UDIMM slots 
        -> non-Xeon is up to 128GB/256GB per node  
      
      Server motherboards have as many as 16 DIMM slots per socket with RDIMM/LRDIMM support, which allows more modules as well as higher capacity modules to be installed.

      0: there has been a 128GB UDIMM launch at peak COVID

    • walrus014 days ago
      There's not much else (other than Epyc) in the way of affordably priced motherboards that have enough cumulative RAM. You can buy a used Dell dual socket older xeon CPU server with 512GB of RAM for test/development purposes for not very much money.

      Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.

      You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.

      Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.

      https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...

      https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...

      https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...

      • numpad03 days ago
        PowerEdge R series is significantly cheaper if you already have an ear protection
        • walrus013 days ago
          yes, an R730 or R740 for instance. There's lots of used R630 and R640 with 512GB of RAM as well, but a 1U server is not the best thing to try putting gaming GPU type pci-express video cards into.
          • genewitch3 days ago
            i imagine you can't use the pcie 1x risers on inference workloads, like you can for crypto mining, as there's no data going between the card and the CPU in crypto and i guess inference is heavily bandwidth restricted. Unfortunate!
            • walrus012 days ago
              The PCI-E slots in something like an 1U R630 or R640 are x8 or x16, the problem is more with the cooling and size/shape of the cards and how they cool themselves. The slots are meant to be used with things like 10 or 100 Gbps network cards or SAS/SATA host adapters which are considerably lower wattage than even a video card that's much weaker than an intel 770.

              Commonly you will also find configuration with two or three 'low profile' pci-express slots which have a different card height than the 'standard' height that most GPUs are built at.

              • genewitch2 days ago
                no, i mean like this https://www.youtube.com/shorts/rTCInAXSzKA

                that works for crypto because all the CPU sends to the card is the target ledger sha256sum (simplified) and the GPU generates nonces until the `sha256sum(sha256sum(nonce += ledger sum)` has however many zeros in the front. So until a card finds the correct nonce, or the server sends "new work" - a new ledger shasum, there's no traffic, really, between the GPU and the CPU. housekeeping, whatever, but not like 1GB/s!

            • numpad02 days ago
              There are Xeon-based crypto motherboards on AliExpress with ~five x8 slots, I do sometimes wonder if those would work. NVIDIA Tesla K80/P40 24GB are on eBay for minimum $15 apiece. 120GB VRAM under $500 or so. Or maybe you could theoretically do a 96GB per node cluster with one x8 link for bottleneck free interconnect.

              But it's likely never going to work, too many driver, compatibility, requisite Kernel development, and power issues, to name a few. Probably cheaper in the end to just go buy 5090 and rant about CUDA.

              • genewitch2 days ago
                I had a couple of those boards. The full size slots aren't actually x8, because the CPU those boards support only have 24 PCIe lanes, and over half of them are just running USB, SATA, etc. The full size slots are so you can secure cards to the board, instead of running them zip tied to metal dish racks (like in the video i linked in my reply to your sibling comment: https://www.youtube.com/shorts/rTCInAXSzKA)
                • numpad02 days ago
                  No, there are versions of those with recycled Xeon E5 and bunch of x8(or so advertised) slots unlike most LGA115x mining boards.
                  • genewitch2 days ago
                    sorry, i completely missed "xeon" in your comment.
  • Gravityloss3 days ago
    I'm sure this question has been asked before, but why not launch a GPU with more but slower ram? That would fit bigger models while still affordable...
    • TeMPOraL3 days ago
      Why would you need it for? Not gaming for sure. AI, you say? Then fork up the cash.

      That's Nvidia's current MO. There's more demand for GPUs for AI than there are GPUs available, and most of that demand still has stupid amounts of money behind it (being able to get grants, loans or investment based on potential/hype) - money that can be captured by GPU vendors. Unfortunately, VRAM is the perfect discriminator between "casual" and "monied" use.

      (This is not unlike the "SSO tax" - single sign-on is pretty much the perfect discriminator between "enterprise use" and "not enterprise use".)

    • ChocolateGod3 days ago
      Because then you would have less motivation to buy the more expensive GPUs.
      • antupis3 days ago
        Yeah, Nvidia doesn't have any incentive to do that and AMD needs to get their shit together at software side.
        • cheschire3 days ago
          This topic is about Intel Arc GPUs though
        • 3 days ago
          undefined
    • fleischhauf3 days ago
      they absolutely can build gpus with larger vram, they just don't have the competition to have to do so. it's much more profitable this way.
    • andrewstuart3 days ago
      Did you miss the news about AMD Halo Strix?

      More than twice as fast as Nvidia 4090 for AI.

      Launched last week.

      • coolspot3 days ago
        > More than twice as fast as Nvidia 4090 for AI.

        Not in memory bandwidth which is all that matter for LLM inference.

      • Gravityloss3 days ago
        I indeed was not aware, thanks
    • varelse3 days ago
      [dead]
  • yongjik3 days ago
    Did DeepSeek learn how to name their models from OpenAI.
    • vlovich1233 days ago
      The convention is weird but it's pretty standard in the industry across all models, particularly GGUF. 671B parameters, quantized to 4 bits. The K_M terminology I believe is more specific to GGUF and describes the specific quantization strategy.
  • notum3 days ago
    Censoring of token/s values in the sample output surely means this runs great!
  • CamperBob24 days ago
    Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)
  • 4 days ago
    undefined
  • mrbonner3 days ago
    I see there are a few options to run inference for LLM and Stable Diffusion outside Nvidia. There is Intel Arc, Apple Ms and now AMD Ryzen AI Max. It is obvious that running in Nvidia would be the most optimal way. But given the availability of high VRAM Nvidia cards at reasonable price, I can't stop thinking about getting one that is not Nvidia. So, if I'm not interested in training or fine tuning, would any of those solutions actually works? On a Linux machine?
    • 9999000009993 days ago
      If you actually want to seriously do this, go with Nvidia.

      This article is basically Intel saying remember us, we made a GPU! And they make great budget cards, but the ecosystem is just so far behind.

      Honestly this is not something you can really do on a budget.

  • andrewstuart3 days ago
    With the arrival of APUs for AI everyone is going to lose interest in GPUs real fast.

    Why buy an overpriced Nvidia 4090 when you can get an AMD Halo Strix or Apple M3 Studio APU with 512GB or 128GB of Ram?

    Nvidia has kept prices high and performance low for as long as it can and finally competition is here.

    Even Intel can make APUs with tons of RAM.

    Nvidia hopefully is squirming.

  • 7speter4 days ago
    I’ve been following the progress Intel Arc support in Pytorch is making, at least in Linux, and it seems like if things stay on track, we may see the first version of pytorch with full Xe/Arc support by around June. I think I’m just going to wait until then instead of dealing with anything ipex or openvino.
    • colorant3 days ago
      This is based on llama.cpp
  • 4 days ago
    undefined
  • 3 days ago
    undefined
  • ryao4 days ago
    Where is the benchmark data?
  • DeathArrow3 days ago
    Any chance of using a couple of 3090 with Deepseek and fit the whole thing in the video RAM? I'm thinking to something like a software or "fake" NVLink.
  • 4 days ago
    undefined
  • zamadatix4 days ago
    Since the Xeon alone could run the model in this set up it'd be more interesting if they compared the performance uplift with using 0/1/2..8 Arc A770 GPUs.

    Also, it's probably better to link straight to the relevant section https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

    • hmottestad4 days ago
      If you’re running just one GPU your context is limited to 1024 tokens, as far as I could tell. I couldn’t see what the context size is for more cards though.
    • colorant4 days ago
      Yes, you are right. Unfortunately HN somehow truncated my original URL link.
      • zamadatix4 days ago
        Sounds like submission "helper" tools are working about as well as normal :).

        Did you have the chance to try this out yourself or did you just run across it recently?

  • chriscappuccio3 days ago
    Better to run the Q8 model on an epyc pair with 768GB, you'll get the same performance
    • ltbarcly33 days ago
      The Q8 model is totally different?
      • manmal3 days ago
        My experience with quantizations is that anything below 6 is noticeably worse. Coherence suffers. I’ve rarely gotten anything really useful out of a Q4 model, code wise. For transformations they are great though, eg convert JSON to Markdown and vice versa.
        • ltbarcly32 days ago
          No I mean the quantized versions of this model in particular have less parameters as well. They are almost different models.
        • yieldcrv3 days ago
          I like Q5

          The sweet spot for me

  • 4 days ago
    undefined
  • anacrolix3 days ago
    Now we just need a model that can actually code
    • ohgr3 days ago
      I'll settle with a much lower bar: an engineer that can tell the code the model generates is shit.
      • brokegrammer3 days ago
        Most engineers can do that because it's way easier to find flaws in code you didn't write vs in ones that you write.

        My code is always perfect in my own eyes until someone else sees it.

        • ohgr3 days ago
          From experience, most engineers can do neither.
  • 3 days ago
    undefined
  • superkuh4 days ago
    No... this headline is incorrect. You can't do that. I think they've confused the performance of running one of the small distills to existing smaller models. Two Arc cards cannot fit a 4 bit k-quant of a 671b model.

    But a portable (no install) way to run llama.cpp on intel GPUs is really cool.

    • Cheer21714 days ago
      You don't have to go that far down the page to see it is paging to system RAM:

      Requirements:

          380GB CPU Memory
          1-8 ARC A770
          500GB Disk
      • superkuh4 days ago
        Yep. That's why the headline is incorrect. 380GB of the model on CPU system RAM and 32GB on some ARC GPUs. The ratio, 380/32, is obvious. Most of the processing is being done on the CPU. The GPU are little bit icing in this context. Fast, sure, but having to wait for the CPU layers (that's how layer splits work with llama.cpp).

        I think changing the end of headline to "Xeon w/380GB RAM" would stop it from being incorrect and misleading.

        • ryao4 days ago
          What if it does not need to read from system RAM for every token by reusing experts whenever they just happen to be in VRAM from being used for the previous token? If the selected experts do not change often, this is doable on paper.
          • hmottestad4 days ago
            That’s probably the main performance benefit of using the GPU. If you’re changing the active expert for every single token then it wouldn’t be any faster than just running it on the CPU. Once you can reuse the active expert for two tokens you’re already going to be a lot faster than just the CPU.

            More GPUs let you keep more experts active at a time.

          • hexaga4 days ago
            Expert distribution should be approximately random token-by-token, so not likely.
          • superkuh2 days ago
            That's not how llama.cpp works. It's a layer split. The GPUs handle a few layers and the CPU handles the rest. The GPU layers no matter how fast they complete still have to wait on the CPU layers.
            • coloranta day ago
              The ipex-llm implementation extends llama.cpp and includes additonal CPU-GPU hybrid optimizations for sparse MoE
        • Cheer21714 days ago
          "with" does not mean "entirely on"

          Edit: but what you added in your edit is right, it would be more accurate to append the system ram requirement

    • ryao4 days ago
      It is theoretically possible. Each token only needs 37B parameters and if the same experts are chosen often, it would behave closer to a 37B model than a 671B model, since reusing experts can skip loads from system RAM.

      You might still be right since I have not confirmed that the selected experts change infrequently doing prompt processing / token generation, and someone could have botched the headline. However, treating Deepseek like llama 3 when reasoning about VRAM requirements is not necessarily correct.

      • hmottestad4 days ago
        If the same expert is chosen for two consecutive tokens then it’ll act like a 37B model running on the GPU for the second token since it doesn’t need to load that expert from the main RAM again.
      • superkuh4 days ago
        MoE is pretty enabling after you've spent all the extra $$$$ to stuff your server CPU memory channels with ram so it's possible to run at all. But it's still spending a lot of money which makes this a lot less novel or interesting than "just on 1~2 Arc A770" implies. Especially for the marginal performance that even 8-12 channels of CPU memory bandwidth gets you.
        • utopcell3 days ago
          Actually, 384GiB is already <$400 [1].

          [1] https://www.amazon.com/NEMIX-RAM-DDR4-2666MHz-PC4-21300-Redu...

          • superkuh3 days ago
            A slow-end DDR4 speed older generation Xeon system is unlikely to be used by Intel for this benchmark. It's far more likely they used an expensive DDR5 modern Xeon with as many memory channels as they could. Single user LLM inference is memory bandwidth bottlenecked. I just can't see Intel using old/deprecated hardware. And if someone not Intel were to build a Xeon DDR4 system it wouldn't reach the DDR5 tokens/s speeds reported here.

            The reason they used a Xeon is memory channels. Non-server CPUs only have 2 but modern Xeons have 8 to 12 depending on generation/type. And the Xeons with the most are the most $$$$ and it ends up cheaper to just get a GPU or dedicated accelerator.

        • utopcell3 days ago
          Is this amount of RAM really that expensive? 6x 64GiB DDR4 DIMMs are < $1,000.
    • 4 days ago
      undefined
    • rgbrgb4 days ago
      yep, title is inaccurate. it's a distill into Qwen 7B DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
      • zamadatix4 days ago
        The document contains multiple sections. The initial section does reference DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf as the example model but if you continue reading further you'll see a section referencing running DeepSeek-R1-Q4_K_M.gguf plus claims several other variations have been tested.

        It's a bit less exciting when you see they're just talking about offloading parts from the large amount of DRAM.

        • genewitch4 days ago
          So you thought there was some magical way to get >600B parameters in a couple of GPUs?

          Also, LM studio lets you run smaller models in front of larger ones, so I could see having a few GPU in front really speeding up using R1 for inference.

          • zamadatix3 days ago
            I had also initially assumed the title was supposed to reference something new about running a distilled variant as well. When I finished reading through and found out the news was just that you can also do this sort of "split" setup with Intel gear too it removed any further hope of excitement.

            DeepSeek employs multi-token prediction which enables self-speculative decoding without needing to employ a separate draft model. Or at least that's what I understood the value of multi-token prediction to be.

          • hmottestad4 days ago
            The MoE architecture allows you to keep the entire active model on a single GPU. If two consecutive tokens use the same export then the second token is going to be much faster.
            • genewitch3 days ago
              I understand all that, I am talking about a separate feature that is possibly backported or from llama.cpp. Where you have a small model that runs first and that is checked by a large model. I've seen 30%+ speedups using like 1.5B in front of a 15B for example.

              Two GPUs or more mean you can start to "keep" one or more of the experts hot on a GPU as well.

            • utopcell3 days ago
              What is the probability of that happening?
              • zamadatix3 days ago
                DeepSeek V3/R1 uses 8 routed experts out of 256, so not all as often as one would like. That said, having even just a single GPU will greatly speed up prompt processing which is worth it even if the inference speed was the same.

                Ktransformers has a document about using CPU + a single 4090D to reach decent tokens/s but I'm not sure how much of the perf is due to the 4090D vs other optimizations/changes for the CPU side https://github.com/kvcache-ai/ktransformers/blob/main/doc/en... The final step of going to 6 experts instead of 8 feels like cheating (not a lossless optimization).

                • genewitch3 days ago
                  where does 256 come from? it's repeated in here and elsewhere that a single expert is 37B sized, so you'd have to have way more than "several hundred billion parameters", to hold 256 of those? Maybe i don't understand the architecture, but if that's the case, then everyone repeating 37B doesn't, either.
                  • zamadatix3 days ago
                    I think this diagram from the DeepSeekMoE paper explains it the clearest: https://i.imgur.com/CRKttob.png The one on the right is how the feed forward layers of DeepSeek V3/R1 work, blue and green are experts, and everything in that right section is what counts as "active parameters".

                    K (K=8 for these models, but you can customize that if you want) experts of 256 per layer are activated at a time. The 256 comes from the model file, it's just how many they chose to build it with. In these models there is also 1 shared expert which is always active in the layer. The router picks which k routed experts to use each forward pass and then a gating mechanism combines the outputs. If you sum the 1 shared expert + K routed experts + router + output networks you end up with 37 B parameters active for each feed forward layer pass. The individual experts are therefore much smaller than the total (probably something like 4 B parameters each? I've never really checked that directly).

                    Or, for the short answer: "37 B is the active parameters of 9 experts + 'overhead', not the parameters of a single expert".