593 pointsby impact_sy4 hours ago63 comments
  • jari_mustonen23 minutes ago
    Open Source as it gets in this space, top notch developer documentation, and prices insanely low, while delivering frontier model capabilities. So basically, this is from hackers to hackers. Loving it!

    Also, note that there's zero CUDA dependency. It runs entirely on Huawei chips. In other words, Chinese ecosystem has delivered a complete AI stack. Like it or not, that's a big news. But what's there not to like when monopolies break down?

    • ifwinterco3 minutes ago
      As a Brit I'm here for it to be honest, I'm tired of America with everything that's going on.

      China is not perfect but a bit of competition is healthy and needed

    • slekker19 minutes ago
      But remember to not ask about Taiwan!
  • throwa35626240 minutes ago
    Seriously, why can't huge companies like OpenAI and Google produce documentation that is half this good??

    https://api-docs.deepseek.com/guides/thinking_mode

    No BS, just a concise description of exactly what I need to write my own agent.

    • lykr0n22 minutes ago
      It's because they're optimizing for a different problem.

      Western Models are optimizing to be used as an interchangeable product. Chinese models are being optimizing to be built upon.

      • raincole19 minutes ago
        > Western Models are optimizing to be used as an interchangeable product

        Why? It sounds like the stupidest idea ever. Interchangeability = no lock-in = no moot.

        • peepee19828 minutes ago
          If you want other people to know whether you're being genuine or sarcastic, you'll have to put a bit more effort into your comments. Your comment just adds noise.
    • vitorgrs23 minutes ago
      Meanwhile, they don't actually say which model you are running on Deepseek Chat website.
    • Alifatisk34 minutes ago
      You might enjoy Z.ais api docs aswell
  • orbital-decay22 minutes ago
    >we implement end-to-end, bitwise batch-invariant, and deterministic kernels with minimal performance overhead

    Pretty cool, I think they're the first to guarantee determinism with the fixed seed or at the temperature 0. Google came close but never guaranteed it AFAIK. DeepSeek show their roots - it may not strictly be a SotA model, but there's a ton of low-level optimizations nobody else pays attention to.

  • primaprashant31 minutes ago
    While SWE-bench Verified is not a perfect benchmark for coding, AFAIK, this is the first open-weights model that has crossed the threshold of 80% score on this by scoring 80.6%.

    Back in Nov 2025, Opus 4.5 (80.9%) was the first proprietary model to do so.

  • revolvingthrowan hour ago
    > pricing "Pro" $3.48 / 1M output tokens vs $4.40

    I’d like somebody to explain to me how the endless comments of "bleeding edge labs are subsidizing the inference at an insane rate" make sense in light of a humongous model like v4 pro being $4 per 1M. I’d bet even the subscriptions are profitable, much less the API prices.

    edit: $1.74/M input $3.48/M output on OpenRouter

    • schneehertzan hour ago
      This price is high even because of the current shortage of inference cards available to DeepSeek; they claimed in their press release that once the Ascend 950 computing cards are launched in the second half of the year, the price of the Pro version will drop significantly
    • amunozo10 minutes ago
      I was thinking the same. How can it be than other providers can offer third-party open source models with roughly the similar quality like this, Kimi K2.6 or GLM 5.1 for 10 times less the price? How can it be that GPT 5.5 is suddenly twice the price as GPT 5.4 while being faster? I don't believe that it's a bigger, more expensive model to run, it's just they're starting to raise up the prices because they can and their product is good (which is honest as long as they're transparent with it). Honestly the movement about subscription costing the company 20 times more than we're paying is just a PR movement to justify the price hike.
    • vitorgrs17 minutes ago
      And they actually say the prices will be "significantly" lower in second semester when Huawei 650 chips comes in.
    • jimmydoe14 minutes ago
      They’ve also announced Pro price will further drop 2H26 once they have more HUAWEI chips.
    • m00xan hour ago
      They are profitable to opex costs, but not capex costs with the current depreciation schedules, though those are now edging higher than expected.
    • dminik16 minutes ago
      I mean, not one "bleeding edge" lab has stated they are profitable. They don't publish financials aside from revenue. And in Anthropic's case, they fuck with pricing every week. Clearly something is wrong here.
    • mirzapan hour ago
      My thoughts exactly. I also believe that subscription services are profitable, and the talk about subsidies is just a way to extract higher profit margins from the API prices businesses pay.
    • raincole38 minutes ago
      Insert always has been meme.

      But seriously, it just stems from the fact some people want AI to go away. If you set your conclusion first, you can very easily derive any premise. AI must go away -> AI must be a bad business -> AI must be losing money.

      • zarzavat33 minutes ago
        Before the AI bubble that will burst any time now, there was the AI winter that would magically arrive before the models got good enough to rival humans.
    • masafej536an hour ago
      Point taken but there isnt any western providers there yet. Power is cheaper in china.
      • NitpickLawyeran hour ago
        As this is a new arch with tons of optimisations, it'll take some time for inference engines to support it properly, and we'll see more 3rd party providers offer it. Once that settles we'll have a median price for an optimised 1.6T model, and can "guesstimate" from there what the big labs can reasonably serve for the same price. But yeah, it's been said for a while that big labs are ok on API costs. The only unknown is if subscriptions were profitable or not. They've all been reducing the limits lately it seems.
      • 3uler40 minutes ago
        These models are open and there are tons of western providers offering it at comparable rates.
    • sekai26 minutes ago
      > I’d like somebody to explain to me how the endless comments of "bleeding edge labs are subsidizing the inference at an insane rate" make sense in light of a humongous model like v4 pro being $4 per 1M. I’d bet even the subscriptions are profitable, much less the API prices.

      One answer - Chinese Communist Party. They are being subsidized by the state.

  • amunozo5 minutes ago
    For those who rely on open source models but don't want to stop using frontier models, how do you manage it? Do you pay any of the Chinese subscription plans? Do you pay the API directly? After GPT 5.5 release, however good it is, I am a bit tired of this price hiking and reduced quota every week. I am now unemployed and cannot afford more expensive plans for the moment.
  • fblp3 hours ago
    There's something heartwarming about the developer docs being released before the flashy press release.
    • 25 minutes ago
      undefined
    • onchainintel3 hours ago
      Insert obligatory "this is the way" Mando scene. Indeed!
    • necovek3 hours ago
      Where's the training data and training scripts since you are calling this open source?

      Edit: it seems "open source" was edited out of the parent comment.

      • b65e8bee43c2ed02 hours ago
        doesn't it get tiring after a while? using the same (perceived) gotcha, over and over again, for three years now?

        no one is ever going to release their training data because it contains every copyrighted work in existence. everyone, even the hecking-wholesome safety-first Anthropic, is using copyrighted data without permission to train their models. there you go.

        • necovekan hour ago
          There is an easy fix already in widespread use: "open weights".

          It is very much a valuable thing already, no need to taint it with wrong promise.

          Though I disagree about being used if it was indeed open source: I might not do it inside my home lab today, but at least Qwen and DeepSeek would use and build on what eg. Facebook was doing with Llama, and they might be pushing the open weights model frontier forward faster.

        • Tepix18 minutes ago
          Nvidia did with Nemo.
        • fragmedean hour ago
          it's not a gotcha but people using words in ways others don't like.
      • bl4ckneonan hour ago
        Aww yes, let me push a couple petabytes to my git repo for everyone to download...
        • necovekan hour ago
          An easier thing would be to say "open weights", yes.
      • 0-_-0an hour ago
        Weights are the source, training data is the compiler.
        • injidup44 minutes ago
          You got it the wrong way round. It's more akin to.

          1. Training data is the source. 2. Training is compilation/compression. 3. Weights are the compiled source akin to optimized assembly.

          However it's an imperfect analogy on so many levels. Nitpick away.

  • yanis_t2 hours ago
    Already on Openrouter. Pro version is $1.74/m/input, $3.48/m/output, while flash $0.14/m/input, 0.28/m/output.
    • astrod2 hours ago
      Getting 'Api Error' here :( Every other model is working fine.
      • poglet2 hours ago
        Try interacting with it through the website, it will give an error and some explanation on the issue. I had to relax my guardrail settings.
    • esafak2 hours ago
      • 77ko2 hours ago
        Its on OR - but currently not available on their anthropic endpoint. OR if you read this, pls enable it there! I am using kimi-2.6 with Claude Code, works well, but Deepseek V4 gives an error:

        `https://openrouter.ai/api/messages with model=deepseek/deepseek-v4-pro, OR returns an error because their Anthropic-compat translator doesn't cover V4 yet. The Claude CLI dutifully surfaces that error as "model...does not exist"

  • sidcool3 hours ago
    Truly open source coming from China. This is heartwarming. I know if the potential ulterior motives.
    • b65e8bee43c2ed0an hour ago
      American companies want a scan of your asshole for the privilege of paying to access their models, and unapologetically admit to storing, analyzing, training on, and freely giving your data to any authorities if requested. Chinese ulteriority is hypothetical, American is blatant.
      • elefantenan hour ago
        It’s not remotely hypothetical you’d have to be living under a rock to believe that. And the fusion with a one-party state government that doesn’t tolerate huge swathes of thoughtspace being freely discussed is completely streamlined, not mediated by any guardrails or accountability.

        This “no harm to me” meme about a foreign totalitarian government (with plenty of incentive to run influence ops on foreigners) hoovering your data is just so mind-bogglingly naive.

        • ben_w30 minutes ago
          As a non-American, everything you wrote other than "one party" applies to the current US regime.

          Relatively speaking, DeepSeek is less untrustworthy than Grok.

          When I try ChatGPT on current events from the White House it interprets them as strange hypotheticals rather than news, which is probably more a problem with DC than with GPT, but whatever.

        • b65e8bee43c2ed09 minutes ago
          >This “no harm to me” meme about a foreign totalitarian government (with plenty of incentive to run influence ops on foreigners) hoovering your data is just so mind-bogglingly naive.

          yes, this is exactly what I'm saying.

        • danny_codes42 minutes ago
          It’s an open model? So you can run it yourself if you want to
        • oceanplexian23 minutes ago
          > And the fusion with a one-party state government that doesn’t tolerate huge swathes of thoughtspace being freely discussed

          That would be a great argument if the American models weren’t so heavily censored.

          The Chinese model might dodge a question if I ask it about 1-2 specific Chinese cultural issues but then it also doesn’t moralize me at every turn because I asked it to use a piece of security software.

        • theshackleford33 minutes ago
          > This “no harm to me” meme about a foreign totalitarian government (with plenty of incentive to run influence ops on foreigners) hoovering your data is just so mind-bogglingly naive.

          This is why I’ve been urging everyone I know to move away from American based services and providers. It’s slow but honest work.

        • t0lo42 minutes ago
          And you're saying Americans aren't banned from criticising their elites?
          • tommica38 minutes ago
            Pretty sure you guys have a strong laws about free-speech, and criticizing elites is part of that. Though there are some groups that do not really want the 1st amendment to be a thing.
            • ben_w27 minutes ago
              > Though there are some groups that do not really want the 1st amendment to be a thing.

              The executive branch?

              • tommica11 minutes ago
                That would be a naïve perspective.
    • Quothling7 minutes ago
      It's a little sad that tech now comes down to geopolitics, but if you're not in the USA then what is the difference? I'm Danish, would I rather give my data to China or to a country which recently threatened the kingdom I live in with military invasion? Ideally I'd give them to Mistral, but in reality we're probably going to continue building multi-model tools to make sure we share our data with everyone equally.
    • spaceman_202024 minutes ago
      I don’t care about whatever “ulterior motives” they might have

      My country’s per capita income is $2500 a year. We can’t pay perpetual rent to OAI/Anthropic

    • try-working2 hours ago
      if you want to understand why labs open source their models: http://try.works/why-chinese-ai-labs-went-open-and-will-rema...
      • wraptile43 minutes ago
        > Internet comments say that open sourcing is a national strategy, a loss maker subsidized by the government. On the contrary, it is a commercial strategy and the best strategy available in this industry.

        This sounds whole lot like potatoh potahto. I think the former argument is very much the correct one: China can undercut everyone and win, even at a loss. Happened with solar panels, steel, evs, sea food - it's a well tested strategy and it works really well despite the many flavors it comes in.

        That being said a job well done for the wrong reasons is still a job well done so we should very much welcome these contributions, and maybe it's good to upset western big tech a bit so it's remains competitive.

        • try-working6 minutes ago
          It is not only that Chinese labs can undercut on price. It is that they must. They must give away their models for free by open sourcing them, and they must even give away free inference services for people to try them. That is the point of the post.
    • I_am_tiberius2 hours ago
      Open weight!
  • mchusma2 hours ago
    For comparison on openrouter DeepSeek v4 Flash is slightly cheaper than Gemma 4 31b, more expensive than Gemma 4 26b, but it does support prompt caching, which means for some applications it will be the cheapest. Excited to see how it compares with Gemma 4.
    • MillionOClock14 minutes ago
      I wonder why there aren't more open weights model with support for prompt caching on OpenRouter.
  • cztomsik3 minutes ago
    So is this the first AI lab using MUON for their frontier model?
  • gbnwl3 hours ago
    I’m deeply interested and invested in the field but I could really use a support group for people burnt out from trying to keep up with everything. I feel like we’ve already long since passed the point where we need AI to help us keep up with advancements in AI.
    • satvikpendem2 hours ago
      Don't keep up. Much like with news, you'll know when you need to know, because someone else will tell you first.
    • wordpad3 hours ago
      The players barely ever change. People don't have problems following sports, you shouldn't struggle so much with this once you accept top spot changes.
      • gbnwl2 hours ago
        I didn't express this well but my interest isn't "who is in the top spot", and is more _why and _how various labs get the results they do. This is also magnified by the fact that I'm not only interested in hosted providers of inference but local models as well. What's your take on the best model to run for coding on 24GB of VRAM locally after the last few weeks of releases? Which harness do you prefer? What quants do you think are best? To use your sports metaphor it's more than following the national leagues but also following college and even high school leagues as well. And the real interest isn't even who's doing well but WHY, at each level.
        • renticulousan hour ago
          Follow the AI newsletters. They bundle the news along with their Op-Ed and summarize it better.
      • ehnto2 hours ago
        It is funny seeing people ping pong between Anthropic and ChatGPT, with similar rhetoric in both directions.

        At this point I would just pick the one who's "ethics" and user experience you prefer. The difference in performance between these releases has had no impact on the meaningful work one can do with them, unless perhaps they are on the fringes in some domain.

        Personally I am trying out the open models cloud hosted, since I am not interested in being rug pulled by the big two providers. They have come a long way, and for all the work I actually trust to an LLM they seem to be sufficient.

        • DiscourseFan2 hours ago
          I find ChatGPT annoying mostly
          • awakeasleep2 hours ago
            Open settings > personalization. Set it to efficient base style. Turn off enthusiasm and warmth. You’re welcome
    • vrganjan hour ago
      It honestly has all kinda felt like more of the same ever since maybe GPT4?

      New model comes out, has some nice benchmarks, but the subjective experience of actually using it stays the same. Nothing's really blown my mind since.

      Feels like the field has stagnated to a point where only the enthusiasts care.

    • trueno2 hours ago
      holy shit im right there with you
  • nthypes3 hours ago
    https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...

    Model was released and it's amazing. Frontier level (better than Opus 4.6) at a fraction of the cost.

    • 0xbadcafebee2 hours ago
      I don't think we need to compare models to Opus anymore. Opus users don't care about other models, as they're convinced Opus will be better forever. And non-Opus users don't want the expense, lock-in or limits.

      As a non-Opus user, I'll continue to use the cheapest fastest models that get my job done, which (for me anyway) is still MiniMax M2.5. I occasionally try a newer, more expensive model, and I get the same results. I have a feeling we might all be getting swindled by the whole AI industry with benchmarks that just make it look like everything's improving.

      • versteegen2 hours ago
        Which model's best depends on how you use it. There's a huge difference in behaviour between Claude and GPT and other models which makes some poor substitutes for others in certain use cases. I think the GPT models are a bad substitute for Claude ones for tasks such as pair-programming (where you want to see the CoT and have immediate responses) and writing code that you actually want to read and edit yourself, as opposed to just letting GPT run in the background to produce working code that you won't inspect. Yes, GPT 5.4 is cheap and brilliant but very black-box and often very slow IME. GPT-5.4 still seems to behave the same as 5.1, which includes problems like: doesn't show useful thoughts, can think for half an hour, says "Preparing the patch now" then thinks for another 20 min, gives no impression of what it's doing, reads microscopic parts of source files and misses context, will do anything to pass the tests including patching libraries...
      • ind-igo2 hours ago
        Agree with your assessment, I think after models reached around Opus 4.5 level, its been almost indistinguishable for most tasks. Intelligence has been commoditized, what's important now is the workflows, prompting, and context management. And that is unique to each model.
        • vidarh2 minutes ago
          Same for me. There are tasks when I want the smartest model. But for a whole lot of tasks I now default to Sonnet, or go with cheaper models like GLM, Kimi, Qwen. DeepSeek hasn't been in the mix for a while because their previous model had started lagging, but will definitely test this one again.

          The tricky part is that the "number of tokens to good result" does absolutely vary, and you need a decent harness to make it work without too much manual intervention, so figuring out which model is most cost-effective for which tasks is becoming increasingly hard, but several are cost-effective enough.

        • wuschel28 minutes ago
          This is not true for some cases e.g. there are stark differences in the correctness of answers in certain type of case work.
      • spaceman_202023 minutes ago
        I found Opus 4.7 to be actually worse than Opus 4.6 for my use case

        Substantially worse at following instructions and overoptimized for maximizing token usage

      • sandosan hour ago
        Is Opus nerfed somehow in Copilot? Ive tried it numerous times, it has never reallt woved me. They seem to have awfully small context windows, but still. Its mostly their reasoning which has been off

        Codex is just so much better, or the genera GPT models.

      • kmarc2 hours ago
        This resonates with me a lot.

        I do some stuff with gemini flash and Aider, but mostly because I want to avoid locking myself into a walled garden of models, UIs and company

      • post-it2 hours ago
        What do you run these on? I've gotten comfortable with Claude but if folks are getting Opus performance for cheaper I'll switch.
        • oceanplexianan hour ago
          You can just use Claude Code with a few env vars, most of these providers offer an Anthropic compatible API
        • slopinthebag2 hours ago
          Try Charm Crush first, it's a native binary. If it's unbearable, try opencode, just with the knowledge your system will probably be pwned soon since it's JS + NPM + vibe coding + some of the most insufferable devs in the industry behind that product.

          If you're feeling frisky, Zed has a decent agent harness and a very good editor.

      • sandGorgonan hour ago
        actually this is not the reason - the harness is significantly better. There is no comparable harness to Claude Code with skills, etc.

        Opencode was getting there, but it seems the founders lost interest. Pi could be it, but its very focused on OpenClaw. Even Codex cli doesnt have all of it.

        which harness works well with Deepseek v4 ?

        • darkwateran hour ago
          What's the issue with OC? I tried it a bit over 2 months ago, when I was still on Claude API, and it actually liked more that CC (i.e. the right sidebar with the plan and a tendency at asking less "security" questions that CC). Why is it so bad nowadays?
      • avereveardan hour ago
        eh idk. until yesterday opus was the one that got spatial reasoning right (had to do some head pose stuff, neither glm 5.1 nor codex 5.3 could "get" it) and codex 5.3 was my champion at making UX work.

        So while I agree mixed model is the way to go, opus is still my workhorse.

      • szundi2 hours ago
        [dead]
    • onchainintel3 hours ago
      How does it compare to Opus 4.7? I've been immersed in 4.7 all week participating in the Anthropic Opus 4.7 hackathon and it's pretty impressive even if it's ravenous from a token perspective compared to 4.6
      • greenknight3 hours ago
        The thing is, it doesnt need to beat 4.7. it just needs to do somewhat well against it.

        This is free... as in you can download it, run it on your systems and finetune it to be the way you want it to be.

        • libraryofbabelan hour ago
          > you can download it, run it on your systems

          In theory, sure, but as other have pointed out you need to spend half a million on GPUs just to get enough VRAM to fit a single instance of the model. And you’d better make sure your use case makes full 24/7 use of all that rapidly-depreciating hardware you just spent all your money on, otherwise your actual cost per token will be much higher than you think.

          In practice you will get better value from just buying tokens from a third party whose business is hosting open weight models as efficiently as possible and who make full use of their hardware. Even with the small margin they charge on top you will still come out ahead.

          • oceanplexian41 minutes ago
            There are a lot of companies who would gladly drop half a million on a GPU to have private inference that Anthropic or OpenAI can’t use to steal their data.

            And that GPU wouldn’t run one instance, the models are highly parallelizable. It would likely support 10-15 users at once, if a company oversubscribed 10:1 that GPU supports ~100 seats. Amortized over a couple years the costs are competitive.

            • libraryofbabel23 minutes ago
              > There are a lot of companies who would gladly drop half a million on a GPU to have private inference that Anthropic or OpenAI can’t use to steal their data.

              Obviously, and certainly companies do run their own models because they place some value on data sovereignty for regulatory or compliance or other reasons. (Although the framing that Anthropic or OpenAI might "steal their data" is a bit alarmist - plenty of companies, including some with _highly_ sensitive data, have contracts with Anthropic or OpenAI that say they can't train future models on the data they send them and are perfectly happy to send data to Claude. You may think they're stupid to do that, but that's just your opinion.)

              > the models are highly parallelizable. It would likely support 10-15 users at once.

              Yes, I know that; I understand LLM internals pretty well. One instance of the model in the sense of one set of weights loaded across X number of GPUs; of course you can then run batch inference on those weights, up to the limits of GPU bandwidth and compute.

              But are those 100 users you have on your own GPUs usings the GPUs evenly across the 24 hours of the day, or are they only using them during 9-5 in some timezone? If so, you're leaving your expensive hardware idle for 2/3 of the day and the third party providers hosting open weight models will still beat you on costs, even without getting into other factors like they bought their GPUs cheaper than you did. Do the math if you don't believe me.

          • hsbauauvhabzban hour ago
            Sure, but that’s an incredibly short term viewpoint.
        • p1esk3 hours ago
          Do you think a lot of people have “systems” to run a 1.6T model?
          • CJefferson2 hours ago
            To me, the important thing isn't that I can run it, it's that I can pay someone else to run it. I'm finding Opus 4.7 seems to be weirdly broken compared to 4.6, it just doesn't understand my code, breaks it whenever I ask it to do anything.

            Now, at the moment, i can still use 4.6 but eventually Anthropic are going to remove it, and when it's gone it will be gone forever. I'm planning on trying Deepseek v4, because even if it's not quite as good, I know that it will be available forever, I'll always be able to find someone to run it.

          • applfanboysbgon2 hours ago
            No, but businesses do. Being able to run quality LLMs without your business, or business's private information, being held at the mercy of another corp has a lot of value.
            • forrestthewoods2 hours ago
              What type of system is needed to self host this? How much would it cost?
              • disiplus2 hours ago
                Depends how many users you have and what is "production grade" for you but like 500k gets you a 8x B200 machine.
              • p1esk2 hours ago
                Depends on fast you want it to be. I’m guessing a couple of $10k mac studio boxes could run it, but probably not fast enough to enjoy using it.
              • fragmede2 hours ago
                One GB200 NVL72 from Nvidia would do it. $2-3 million, or so. If you're a corporation, say Walmart or PayPal, that's not out of the question.

                If you want to go budget corporate, 7 x H200 is just barely going to run it, but all in, $300k ought to do it.

                • gloflo2 hours ago
                  How many users can you serve with that?
                  • fragmedean hour ago
                    For the H200, between 150-700. The GB200 gets you something like 2-10k users.
              • CamperBob235 minutes ago
                $20K worth of RTX 6000 Blackwell cards should let you run the Flash version of the model.
            • choldstare2 hours ago
              Not really - on prem llm hosting is extremely labor and capital intensive
              • applfanboysbgon2 hours ago
                But can be, and is, done. I work for a bootstrapped startup that hosts a DeepSeek v3 retrain on our own GPUs. We are highly profitable. We're certainly not the only ones in the space, as I'm personally aware of several other startups hosting their own GLM or DeepSeek models.
                • wuschel15 minutes ago
                  Why a retrain? What are you using the model for?
          • 3 hours ago
            undefined
        • onchainintel3 hours ago
          Completely agree, not suggesting it needs ot just genuinely curious. Love that it can be run locally though. Open source LLMs punching back pretty hard against proprietary ones in the cloud lately in terms of performance.
        • kelseyfrog3 hours ago
          What's the hardware cost to running it?
          • redox993 hours ago
            Probably like 100 USD/hour
          • bbor2 hours ago
            I was curious, and some [intrepid soul](https://wavespeed.ai/blog/posts/deepseek-v4-gpu-vram-require...) did an analysis. Assuming you do everything perfectly and take full advantage of the model's MoE sparsity, it would take:

            - To run at full precision: "16–24 H100s", giving us ~$400-600k upfront, or $8-12/h from [us-east-1](https://intuitionlabs.ai/articles/h100-rental-prices-cloud-c...).

            - To run with "heavy quantization" (16 bits -> 8): "8xH100", giving us $200K upfront and $4/h.

            - To run truly "locally"--i.e. in a house instead of a data center--you'd need four 4090s, one of the most powerful consumer GPUs available. Even that would clock in around $15k for the cards alone and ~$0.22/h for the electricity (in the US).

            Truly an insane industry. This is a good reminder of why datacenter capex from since 2023 has eclipsed the Manhattan Project, the Apollo program, and the US interstate system combined...

            • oceanplexian34 minutes ago
              All these number are peanuts to a mid sized company. A place I worked at used to spend a couple million just for a support contract on a Netapp.

              10 years from now that hardware will be on eBay for any geek with a couple thousand dollars and enough power to run it.

            • zargon2 hours ago
              That article is a total hallucination.

              "671B total / 37B active"

              "Full precision (BF16)"

              And they claim they ran this non-existent model on vLLM and SGLang over a month and a half ago.

              It's clickbait keyword slop filled in with V3 specs. Most of the web is slop like this now. Sigh.

          • slashdave3 hours ago
            "if you have to ask..."
        • johnmaguire3 hours ago
          ... if you have 800 GB of VRAM free.
          • inventor77773 hours ago
            I remember reading about some new frameworks have been coming out to allow Macs to stream weights of huge models live from fast SSDs and produce quality output, albeit slowly. Apart from that...good luck finding that much available VRAM haha
      • spaceman_202022 minutes ago
        Tbh I was more productive with 4.6 than ever before and if AI progress locks in permanently at 4.6 tier, I’d be pretty happy
      • rvz3 hours ago
        It is more than good enough and has effectively caught up with Opus 4.6 and GPT 5.4 according to the benchmarks.

        It's about 2 months behind GPT 5.5 and Opus 4.7.

        As long as it is cheap to run for the hosting providers and it is frontier level, it is a very competitive model and impressive against the others. I give it 2 years maximum for consumer hardware to run models that are 500B - 800B quantized on their machines.

        It should be obvious now why Anthropic really doesn't want you to run local models on your machine.

        • deaux2 hours ago
          Vibes > Benchmarks. And it's all so task-specific. Gemini 3 has scored very well in benchmarks for very long but is poor at agentic usecases. A lot of people prefering Opus 4.6 to 4.7 for coding despite benchmarks, much more than I've seen before (4.5->4.6, 4->4.5).

          Doesn't mean Deepseek v4 isn't great, just benchmarks alone aren't enough to tell.

        • snovv_crash2 hours ago
          With the ability of the Qwen3.6 27B, I think in 2 years consumers will be running models of this capability on current hardware.
        • colordrops2 hours ago
          What's going to change in 2 years that would allow users to run 500B-800B parameter models on consumer hardware?
    • doctoboggan3 hours ago
      Is it honestly better than Opus 4.6 or just benchmaxxed? Have you done any coding with an agent harness using it?

      If its coding abilities are better than Claude Code with Opus 4.6 then I will definitely be switching to this model.

      • bokkiesan hour ago
        Apparently glm5.1 and qwen coder latest is as good as opus 4.6 on benchmarks. So I tried both seriously for a week (glm Pro using CC) and qwen using qwen companion. Thought I could save $80 a month. Unfortunately after 2 days I had switched back to Max. The speed (slower on both although qwen is much faster) and errors (stupid layout mistakes, inserting 2 footers then refusing to remove one, not seeing obvious problems in screenshots & major f-ups of functionality), not being able to view URLs properly, etc. I'll give deepseek a go but I suspect it will be similar. The model is only half the story. Also been testing gpt5.4 with codex and it is very almost as good as CC... better on long running tasks running in background. Not keen on ChatGPT codex 'personality' so will stick to CC for the most part.
      • madagang3 hours ago
        Their Chinese announcement says that, based on internal employee testing, it is not as good as Opus 4.6 Thinking, but is slightly better than Opus 4.6 without Thinking enabled.
        • mchusma3 hours ago
          I appreciate this, makes me trust it more than benchmarks.
        • ibican hour ago
          In case people wonder where the announcement is (you can easily translate it via browser if you don't read Chinese): https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg

          It's still a "preview" version atm.

        • deaux2 hours ago
          That's super interesting, isn't Deepseek in China banned from using Anthropic models? Yet here they're comparing it in terms of internal employee testing.
          • renticulousan hour ago
            They use VPN to access. Even Google Deepmind uses Anthropic. There was a fight within Google as to why only DeepMind is allowed to Claude while rest of the Google can't.
      • 3 hours ago
        undefined
    • bbor2 hours ago
      For the curious, I did some napkin math on their posted benchmarks and it racks up 20.1 percentage point difference across the 20 metrics where both were scored, for an average improvement of about 2% (non-pp). I really can't decide if that's mind blowing or boring?

      Claude4.6 was almost 10pp better at at answering questions from long contexts ("corpuses" in CorpusQA and "multiround conversations" in MRCR), while DSv4 was a staggering 14pp better at one math challenge (IMOAnswerBench) and 12pp better at basic Q&A (SimpleQA-Verified).

    • NitpickLawyer2 hours ago
      > (better than Opus 4.6)

      There we go again :) It seems we have a release each day claiming that. What's weird is that even deepseek doesn't claim it's better than opus w/ thinking. No idea why you'd say that but anyway.

      Dsv3 was a good model. Not benchmaxxed at all, it was pretty stable where it was. Did well on tasks that were ood for benchmarks, even if it was behind SotA.

      This seems to be similar. Behind SotA, but not by much, and at a much lower price. The big one is being served (by ds themselves now, more providers will come and we'll see the median price) at 1.74$ in / 3.48$ out / 0.14$ cache. Really cheap for what it offers.

      The small one is at 0.14$ in / 0.28$ out / 0.028$ cache, which is pretty much "too cheap to matter". This will be what people can run realistically "at home", and should be a contender for things like haiku/gemini-flash, if it can deliver at those levels.

      • slopinthebag2 hours ago
        Anthropic fans would claim God itself is behind Opus by 3-6 months and then willingly be abused by Boris and one of his gaslighting tweets.

        LMAO

        • NitpickLawyeran hour ago
          > Anthropic fans ...

          I have no idea why you'd think that, but this is straight from their announcement here (https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg):

          > According to evaluation feedback, its user experience is better than Sonnet 4.5, and its delivery quality is close to Opus 4.6's non-thinking mode, but there is still a certain gap compared to Opus 4.6's thinking mode.

          This is the model creators saying it, not me.

    • an hour ago
      undefined
    • sergiotapia3 hours ago
      The dragon awakes yet again!
      • kindkang20242 hours ago
        There appears a flight of dragons without heads. Good fortune.

        That's literally what the I Ching calls "good fortune."

        Competition, when no single dragon monopolizes the sky, brings fortune for all.

    • rapind3 hours ago
      Pop?
  • aquir17 minutes ago
    It is great! I asked the question what I always ask of new models ("what would Ian M Banks think about the current state of AI") and it gave me a brilliant answer! Funny enough the answer contained multiple criticisms of his own creators ("Chinese state entities", "Social Credit System").
  • coderssh19 minutes ago
    Feels like the real story here is cost/performance tradeoff rather than raw capability. Benchmarks keep moving incrementally, but efficiency gains like this actually change who can afford to build on top.
  • Imanarian hour ago
    Just tested it via openrounter in the Pi Coding agent and it regularly fails to use the read and write tool correctly, very disappointing. Anyone know a fix besides prompting "always use the provided tools instead of writing your own call"
    • abstracthinkingan hour ago
      They have just released it, give it some time, they probably haven't pretested it with Pi
      • Imanari38 minutes ago
        How can they fix it after the release? They would have to retrain/finetune it further, no?
        • zargon31 minutes ago
          It's only in preview right now. And anyway, yes, models regularly get updated training.

          But in this case, it's more likely just to be a tooling issue.

  • zkmonan hour ago
    They released 1.6 T pro base model on huggingface. First time I'm seeing a "T" model here.
  • zargon3 hours ago
    The Flash version is 284B A13B in mixed FP8 / FP4 and the full native precision weights total approximately 154 GB. KV cache is said to take 10% as much space as V3. This looks very accessible for people running "large" local models. It's a nice follow up to the Gemma 4 and Qwen3.5 small local models.
    • sbinnee2 hours ago
      Price is appealing to me. I have been using gemini 3 flash mainly for chat. I may give it a try.

      input: $0.14/$0.28 (whereas gemini $0.5/$3)

      Does anyone know why output prices have such a big gap?

      • girvoan hour ago
        Output is what the compute is used for above all else; costs more hardware time basically than prompt processing (input) which is a lot faster
      • tokenmaxxinej34 minutes ago
        input tokens are processed at 10-50 times the speed of output tokens since you can process then in batches and not one at a time like output tokens
  • CJefferson2 hours ago
    What's the current best framework to have a 'claude code' like experience with Deepseek (or in general, an open-source model), if I wanted to play?
    • Alifatisk27 minutes ago
      You can use CC with other models, you aren’t forced to use Claude model.
    • whoopdeepoo2 hours ago
      You can use deepseek with Claude code
      • esperent27 minutes ago
        You can, but does it work well? I assume CC has all kinds of Claude specific prompts in it, wouldn't you be better with a harness designed to be model agnostic like pi.dev or OpenCode?
    • 0x1428572 hours ago
      claude-code-cli/opencode/codex
  • bandramian hour ago
    I don't mind that High Flyer completely ripped off Anthropic to do this so much as I mind that they very obviously waited long enough for the GAB to add several dozen xz-level easter eggs to it.
  • simonw2 hours ago
    I like the pelican I got out of deepseek-v4-flash more than the one I got from deepseek-v4-pro.

    https://simonwillison.net/2026/Apr/24/deepseek-v4/

    Both generated using OpenRouter.

    For comparison, here's what I got from DeepSeek 3.2 back in December: https://simonwillison.net/2025/Dec/1/deepseek-v32/

    And DeepSeek 3.1 in August: https://simonwillison.net/2025/Aug/22/deepseek-31/

    And DeepSeek v3-0324 in March last year: https://simonwillison.net/2025/Mar/24/deepseek/

    • torginus31 minutes ago
      This is just a random thought, but have you tried doing an 'agentic' pelican?

      As in have the model consider its generated SVG, and gradually refine it, using its knowledge of the relative positions and proportions of the shapes generated, and have it spin for a while, and hopefully the end result will be better than just oneshotting it.

      Or maybe going even one step further - most modern models have tool use and image recognition capabilities - what if you have it generate an SVG (or parts/layers of it, as per the model's discretion) and feed it back to itself via image recognition, and then improve on the result.

      I think it'd be interesting to see, as for a lot of models, their oneshot capability in coding is not necessarily corellated with their in-harness ability, the latter which really matters.

      • simonw9 minutes ago
        I tried that for the GPT-5 launch - a self-improving loop that renders the SVG, looks at it and tries again - and the results were surprisingly disappointing.

        I should try it again with the more recent models.

    • JSR_FDED2 hours ago
      No way. The Pro pelican is fatter, has a customized front fork, and the sun is shining! He’s definitely living the best life.
      • chronograman hour ago
        The pro pelican is a work of art! It goes dimensions that no other LLM has gone before.
      • w4yai2 hours ago
        yeah. look at these 4 feathers (?) on his bum too.
      • oliver2362 hours ago
        a lot of dumplings
    • nickvec2 hours ago
      The Flash one is pretty impressive. Might be my favorite so far in the pelican-riding-a-bicycle series
    • murktan hour ago
      DeepSeek pelicans are the angriest pelicans I’ve seen so far.
    • mikae12 hours ago
      Being a bicycle geometry nerd I always look at the bicycle first.

      Let me tell you how much the Pro one sucks... It looks like failed Pedersen[1]. The rear wheel intersects with the bottom bracket, so it wouldn't even roll. Or rather, this bike couldn't exist.

      The flash one looks surprisingly correct with some wild fork offset and the slackest of seat tubes. It's got some lowrider[2] aspirations with the small wheels, but with longer, Rivendellish[3], chainstays. The seat post has different angle than the seat tube, so good luck lowering that.

      [1] https://en.wikipedia.org/wiki/Pedersen_bicycle

      [2] https://en.wikipedia.org/wiki/Lowrider_bicycle

      [3] https://www.rivbike.com/

      • simonw2 hours ago
        This is an excellent comment. Thanks for this - I've only ever thought about whether the frame is the right shape, I never thought about how different illustrations might map to different bicycle categories.
        • mikae1an hour ago
          Some other reactions:

          I wonder which model will try some more common spoke lacing patterns. Right now there seems to be a preference for radial lacing, which is not super common (but simple to draw). The Flash and Pro one uses 16 spoke rims, which actually exist[1] but are not super common.

          The Pro model fails badly at the spokes. Heck, the spokes sit on the outside of the drive side of the rim and tire. Have a nice ride riding on the spokes (instead of the tire) welded to the side of your rim.

          Both bikes have the drive side on the left, which is very very uncommon. That can't exist in the training data.

          [1] https://cicli-berlinetta.com/product/campagnolo-shamal-16-sp...

      • jojobasan hour ago
        The Pedersen looks like someone failed the "draw a bicycle" test and decided to adjust the universe.
    • catelman hour ago
      I think the pelican on a bike is known widely enough that of seizes to be useful as a benchmark. There is even a pelican briefly appearing in the promo video of GPT-5, if I'm not mistaken https://openai.com/gpt-5/. So the companies are apparently aware of it.
    • nsoonhuian hour ago
      To me this is the perfect proof that

      1) LLM is not AGI. Because surely if AGI it would imply that pro would do better than flash?

      2) and because of the above, Pelican example is most likely already being benchmaxxed.

    • chvidan hour ago
      Is it then Deepseek hosted by Deepseek?

      How much does the drawing change if you ask it again?

    • brutal_chaos_an hour ago
      What was your prompt for the image? Apologies if this should be obvious.
      • shawn_wan hour ago
        >Generate an SVG of a pelican riding a bicycle

        at the top of the linked pages.

    • ycui19862 hours ago
      I really like the pro version. The pelican is so cute.
    • theanonymousonean hour ago
      Where is the GPT 5.5 Pelican?
    • lobochromean hour ago
      Why they so angry?
    • whateveracct2 hours ago
      [flagged]
      • fastball2 hours ago
        It's just Simon Willison (the person you are replying to) who always makes a pelican, as his personal flippant benchmark. It's not that deep.
      • dewey2 hours ago
        No benchmark will be perfect, especially if it's public but it's a fun experiment to visually see how these models get better and better.
      • post-it2 hours ago
        Why is it so wrong?
      • simonw2 hours ago
        Thanks for the "scientific air" remark, that gave me a genuine LOL.
    • EnPissantan hour ago
      This should not be the top comment on every model release post. It's getting tiring.
      • blitzaran hour ago
        This should be the bottom comment on the pelican comment on every model release post.
        • EnPissantan hour ago
          Clearly the top comment should be "Imagine a beowulf cluster of Deepseek v4!"
          • ButlerianJihadan hour ago
            My mother was murdered by Beowulf, you insensitive Claude!
  • rohanm93an hour ago
    This is shockingly cheap for a near frontier model. This is insane.

    For context, for an agent we're working on, we're using 5-mini, which is $2/1m tokens. This is $0.30/1m tokens. And it's Opus 4.6 level - this can't be real.

    I am uncomfortable about sending user data which may contain PII to their servers in China so I won't be using this as appealing as it sounds. I need this to come to a US-hosted environment at an equivalent price.

    Hosting this on my own + renting GPUs is much more expensive than DeepSeek's quoted price, so not an option.

    • esperent23 minutes ago
      > I am uncomfortable about sending user data which may contain PII to their servers in China

      As a European I feel deeply uncomfortable about sending data to US companies where I know for sure that the government has access to it.

      I also feel uncomfortable sending it to China.

      If you'd asked me ten years ago which one made me more uncomfortable. China.

      But now I'm not so sure, in fact I'm starting to lean towards the US as being the major risk.

    • fractalfan hour ago
      Right now Im much more worried about sending data to the US and A.. At least theres a less chanse it will be missused against -me-
  • jessepcc3 hours ago
    At this point 'frontier model release' is a monthly cadence, Kimi 2.6 Claude 4.6 GPT 5.5, the interesting question is which evals will still be meaningful in 6 months.
  • xnxan hour ago
    Such different time now than early 2025 when people thought Deepaeek was going to kill the market for Nvidia.
    • Ifkaluva19 minutes ago
      They might still kill the market for NVIDIA, if future releases prioritize Huawei chips
  • storus2 hours ago
    Oh well, I should have bought 2x 512GB RAM MacStudios, not just one :(
  • Aliabid943 hours ago
    MMLU-Pro:

    Gemini-3.1-Pro at 91.0

    Opus-4.6 at 89.1

    GPT-5.4, Kimi2.6, and DS-V4-Pro tied at 87.5

    Pretty impressive

    • ant6n2 hours ago
      Funny how Gemini is theoretically the best -- but in practice all the bugs in the interface mean I don't want to use it anymore. The worst is it forgets context (and lies about it), but it's very unreliable at reading pdfs (and lies about it). There's also no branch, so once the context is lost/polluted, you have to start projects over and build up the context from scratch again.
      • spaceman_202018 minutes ago
        The sheer number of bugs and lack of meaningful improvements in Google products is a clear counterargument to the AI bull thesis

        If AI was so good at coding, why can’t it actually make a usable Gemini/AI Studio app?

      • lazycatjumping16 minutes ago
        I gave up on Gemini 3.1 Pro in VSCode after 2 hours. They fully refunded me.
      • esperent22 minutes ago
        Yeah if I could use Gemini with pi.dev that would be my choice. But Gemini CLI is just so, so bad.
  • gardnran hour ago
    865 GB: I am going to need a bigger GPU.
  • apexalpha40 minutes ago
    This FLash model might be affordable for OpenClaw. I run it on my mac 48gb ram now but it's slowish.
  • jdeng3 hours ago
    Excited that the long awaited v4 is finally out. But feel sad that it's not multimodal native.
  • clark10132 hours ago
    Looking forward to DeepSeek Coding Plan
  • sibellaviaan hour ago
    A few hours after GPT5.5 is wild. Can’t wait to try it.
  • luyu_wu3 hours ago
    For those who didn't check the page yet, it just links to the API docs being updated with the upcoming models, not the actual model release.
  • gigatexal15 minutes ago
    Has anyone used it? How does it compare to gpt 5.5 or opus 4.7?
  • jfxia21 minutes ago
    Is V4 still not a multi-modal model?
    • vitorgrs11 minutes ago
      Not yet... Which is a shame.
  • tcbrahan hour ago
    giving meta a run for its money, esp when it was supposed to be the poster child for OSS models. deepseek is really overshadowing them rn
  • WhereIsTheTruth36 minutes ago
    Interesting note:

    "Due to constraints in high-end compute capacity, the current service capacity for Pro is very limited. After the 950 supernodes are launched at scale in the second half of this year, the price of Pro is expected to be reduced significantly."

    So it's going to be even cheaper

  • tarikyan hour ago
    Anyone tried with make web UI with it? How good is it? For me opus is only worth because of it.
  • augment_me28 minutes ago
    Amaze amaze amaze
  • KaoruAoiShiho3 hours ago
    SOTA MRCR (or would've been a few hours earlier... beaten by 5.5), I've long thought of this as the most important non-agentic benchmark, so this is especially impressive. Beats Opus 4.7 here
  • coolThingsFirst22 minutes ago
    I got an API key without credit card details I didn’t know they had a free plan.
  • reenorap3 hours ago
    Which version fits in a Mac Studio M3 Ultra 512 GB?
  • aliljet2 hours ago
    How can you reasonably try to get near frontier (even at all tps) on hardware you own? Maybe under 5k in cost?
    • revolvingthrow2 hours ago
      For flash? 4 bit quant, 2x 96GB gpu (fast and expensive) or 1x 96GB gpu + 128GB ram (still expensive but probably usable, if you’re patient).

      A mac with 256 GB memory would run it but be very slow, and so would be a 256GB ram + cheapo GPU desktop, unless you leave it running overnight.

      The big model? Forget it, not this decade. You can theoretically load from SSD but waiting for the reply will be a religious experience.

      Realistically the biggest models you can run on local-as-in-worth-buying-as-a-person hardware are between 120B and 200B, depending on how far you’re willing to go on quantization. Even this is fairly expensive, and that’s before RAM went to the moon.

      • zargonan hour ago
        Flash is less than 160 GB. No need to quantize to fit in 2x 96 GB. Not sure how much context fits in 30 GB, but it should be a good amount.
        • redrovean hour ago
          It seems to be 160GB at mixed FP4+FP8 precision, FYI. Full FP8 is 250GB+. (B)F16 at around double I would assume.
          • zargonan hour ago
            There is no BF16. The instruct model at full precision is 160 GB (mixed FP4 and FP8). The base model at full precision is 284 GB (FP8). Almost everyone is going to use instruct. But I do love to see base models released.
    • zozbot234an hour ago
      Run on an old HEDT platform with a lot of parallel attached storage (probably PCIe 4) and fetch weights from SSD. You'd ultimately be limited by the latency of these per-layer fetches, since MoE weights are small. You could reduce the latencies further by buying cheap Optane memory on the second-hand market.
    • awakeasleep2 hours ago
      The same way you fit a bucket wheel excavator in your garage
      • floam2 hours ago
        Very carefully
    • datadrivenangel2 hours ago
      A loaded macbook pro can get you to the frontier from 24 months ago at ~10-40tok/s, which is plenty fast enough for regular chatting.
    • 5424582 hours ago
      The low end could be something like an eBay-sourced server with a truckload of DDR3 ram doing all-cpu inference - secondhand server models with a terabyte of ram can be had for about 1.5K. The TPS will be absolute garbage and it will sound like a jet engine, but it will nominally run.

      The flash version here is 284B A13B, so it might perform OK with a fairly small amount of VRAM for the active params and all regular ram for the other params, but I’d have to see benchmarks. If it turns out that works alright, an eBay server plus a 3090 might be the bang-for-buck champ for about $2.5K (assuming you’re starting from zero).

    • jdoe1337halo2 hours ago
      More like 500k
  • mariopt2 hours ago
    Does deepseek has any coding plan?
  • swrrt3 hours ago
    Any visualised benchmark/scoreboard for comparison between latest models? DeepSeek v4 and GPT-5.5 seems to be ground breaking.
  • rvz3 hours ago
    The paper is here: [0]

    Was expecting that the release would be this month [1], since everyone forgot about it and not reading the papers they were releasing and 7 days later here we have it.

    One of the key points of this model to look at is the optimization that DeepSeek made with the residual design of the neural network architecture of the LLM, which is manifold-constrained hyper-connections (mHC) which is from this paper [2], which makes this possible to efficiently train it, especially with its hybrid attention mechanism designed for this.

    There was not that much discussion around it some months ago here [3] about it but again this is a recommended read of the paper.

    I wouldn't trust the benchmarks directly, but would wait for others to try it for themselves to see if it matches the performance of frontier models.

    Either way, this is why Anthropic wants to ban open weight models and I cannot wait for the quantized versions to release momentarily.

    [0] https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...

    [1] https://news.ycombinator.com/item?id=47793880

    [2] https://arxiv.org/abs/2512.24880

    [3] https://news.ycombinator.com/item?id=46452172

    • jeswin3 hours ago
      > this is why Anthropic wants to ban open weight models

      Do you have a source?

      • louiereederson2 hours ago
        More like he wants to ban accelerator chip sales to China, which may be about “national security” or self preservation against a different model for AI development which also happens to be an existential threat to Anthropic. Maybe those alternatives are actually one and the same to him.
      • 2 hours ago
        undefined
  • luewan hour ago
    We will be hosting it soon at getlilac.com!
  • namegulf3 hours ago
    Is there a Quantized version of this?
  • sergiotapiaan hour ago
    Using it with opencode sometimes it generates commands like:

        bash({"command":"gh pr create --title "Improve Calendar module docs and clean up idiomatic Elixir" --body "$(cat <<'EOF'
        Problem
        The Calendar modu...
    
    like generating output, but not actually running the bash command so not creating the PR ultimately. I wonder if it's a model thing, or an opencode thing.
  • punkpeyean hour ago
    Incredible model quality to price ratio
  • ls6123 hours ago
    How long does it usually take for folks to make smaller distills of these models? I really want to see how this will do when brought down to a size that will run on a Macbook.
    • simonw2 hours ago
      Unsloth often turn them around within a few hours, they might have gone to bed already though!

      Keep an eye on https://huggingface.co/unsloth/models

      Update ten minutes later: https://huggingface.co/unsloth/DeepSeek-V4-Pro just appeared but doesn't have files in yet, so they are clearly awake and pushing updates.

    • inventor77772 hours ago
      Weren't there some frameworks recently released to allow Macs to stream weights from fast SSDs and thus fit way more parameters than what would normally fit in RAM?

      I have never tried one yet but I am considering trying that for a medium sized model.

      • simonw2 hours ago
        I've been calling that the "streaming experts" trick, the key idea is to take advantage of Mixture of Expert models where only a subset of the weights are used for each round of calculations, then load those weights from SSD into RAM for each round.

        As I understand it if DeepSeek v4 Pro is a 1.6T, 49B active that means you'd need just 49B in memory, so ~100GB at 16 bit or ~50GB at 8bit quantized.

        v4 Flash is 284B, 13B active so might even fit in <32GB.

        • zozbot234an hour ago
          The "active" count is not very meaningful except as a broad measure of sparsity, since the experts in MoE models are chosen per layer. Once you're streaming experts from disk, there's nothing that inherently requires having 49B parameters in memory at once. Of course, the less caching memory does, the higher the performance overhead of fetching from disk.
        • inventor77772 hours ago
          Ahh, that actually makes more sense now. (As you can tell, I just skimmed through the READMEs and starred "for later".)

          My Mac can fit almost 70B (Q3_K_M) in memory at once, so I really need to try this out soon at maybe Q5-ish.

        • zargon2 hours ago
          > ~100GB at 16 bit or ~50GB at 8bit quantized.

          V4 is natively mixed FP4 and FP8, so significantly less than that. 50 GB max unquantized.

        • EnPissantan hour ago
          Streaming weights from RAM to GPU for prefill makes sense due to batching and pcie5 x16 is fast enough to make it worthwhile.

          Streaming weights from RAM to GPU for decode makes no sense at all because batching requires multiple parallel streams.

          Streaming weights from SSD _never_ makes sense because the delta between SSD and RAM is too large. There is no situation where you would not be able to fit a model in RAM and also have useful speeds from SSD.

      • zozbot234an hour ago
        These are more like experiments than a polished release as of yet. And the reduction in throughput is high compared to having the weights in RAM at all times, since you're bottlenecked by the SSD which even at its fastest is much slower than RAM.
      • the_sleaze_2 hours ago
        Do you have the links for those? Very interested
  • 3 hours ago
    undefined
  • hongbo_zhang2 hours ago
    congrats
  • dhruv30062 hours ago
    Ah now !
  • creamyhorror3 hours ago
    [dead]
  • hubertzhang2 hours ago
    [dead]
  • maryjeiel3 hours ago
    [dead]
  • minhajulmahib3 hours ago
    [flagged]
    • polski-g2 hours ago
      Why did you bother to submit an AI comment?
      • sidcool2 hours ago
        I suspect you may have replied to a bot. Dead internet theory
  • slopinthebag2 hours ago
    OMG

    OMG ITS HAPPENING

  • shafiemoji3 hours ago
    I hope the update is an improvement. Losing 3.2 would be a real loss, it's excellent.
  • raincole3 hours ago
    History doesn't always repeat itself.

    But if it does, then in the following week we'll see DeepSeek4 floods every AI-related online space. Thousands of posts swearing how it's better than the latest models OpenAI/Anthropic/Google have but only costs pennies.

    Then a few weeks later it'll be forgotten by most.

    • sbysb3 hours ago
      It's difficult because even if the underlying model is very good, not having a pre-built harness like Claude Code makes it very un-sticky for most devs. Even at equal quality, the friction (or at least perceived friction) is higher than the mainstream models.
      • raincole3 hours ago
        OpenCode? Pi?

        If one finds it difficult to set up OpenCode to use whatever providers they want, I won't call them 'dev'.

        The only real friction (if the model is actually as good as SOTA) is to convince your employer to pay for it. But again if it really provides the same value at a fraction of the cost, it'll eventually cease to be an issue.

        • throwa356262an hour ago

              "If one finds it difficult to set up OpenCode to use whatever providers they want, I won't call them 'dev'."
          
          
          I feel the same way. But look at the ollama vs llama.cpp post from HN few days back and you will see most of the enthusiasts in this space are very non technical people.
          • zargon34 minutes ago
            I think you mean ollama vs llama.cpp.
      • cmrdporcupine2 hours ago
        They have instructions right on their page on how to use claude code with it.
    • slopinthebag2 hours ago
      [flagged]