Kimi K2(twitter.com)

197 pointsby c4pt0ra day ago15 comments

simonwa day ago
Pelican on a bicycle result: https://simonwillison.net/2025/Jul/11/kimi-k2/
- qmmmur19 hours ago
  I'm glad we are looking to build nuclear reactors so we can do more of this...
  - sergiotapia13 hours ago
    me too - we must energymaxx. i want a nuclear reactor in my backyard powering everything. I want ac units in every room and my open door garage while i workout.
    GenerWork10 hours ago
    You're saying this in jest, but I would LOVE to have a nuclear reactor in my backyard that produced enough power to where I could have a minisplit for every room in my house, including the garage so I could work out in there.
    CaptainFever9 hours ago
    Related: https://en.wikipedia.org/wiki/Kardashev_scale
    > The Kardashev scale (Russian: шкала Кардашёва, romanized: shkala Kardashyova) is a method of measuring a civilization's level of technological advancement based on the amount of energy it is capable of harnessing and using.
    > Under this scale, the sum of human civilization does not reach Type I status, though it continues to approach it.
    sergiotapia6 hours ago
    I am not joking
- jug8 hours ago
  That's perhaps the best one I've seen yet! For an open weight model, this performance is of course particularly remarkable and impactful.
- ebiestera day ago
  At this point, they have to be training it. At what point will you start using something else?
  - simonwa day ago
    Once I get a picture that genuinely looks like a pelican riding a bicycle!
- csomar18 hours ago
  Much better than that of Grok 4.
- _alex_a day ago
  wow!
simonwa day ago
Big release - https://huggingface.co/moonshotai/Kimi-K2-Instruct model weights are 958.52 GB
- c4pt0ra day ago
  Paired with programming tools like Claude Code, it could be a low-cost/open-source replacement for Sonnet
  - martin_a day ago
    how do you low cost run a 1T param model?
    maven29a day ago
    32B active parameters with a single shared expert.
    JustFinishedBSGa day ago
    This doesn’t change the VRAM usage, only the compute requirements.
    selfhoster11a day ago
    It does not have to be VRAM, it could be system RAM, or weights streamed from SSD storage. Reportedly, the latter method achieves around 1 token per second on computers with 64 GB of system RAM.
    R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.
    If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.
    refulgentisa day ago
    The amount of people who will be using it at 1 token/sec because there's no better option, and have 64 GB of RAM, is vanishingly small.
    IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience.
    * the implication being, anything is possible in that scenario
    selfhoster1120 hours ago
    Good. Vanishingly small is still more than zero. Over time, running such models will become easier too, as people slowly upgrade to better hardware. It's not like there aren't options for the compute-constrained either. There are lots of Chinese models in the 3-32B range, and Gemma 3 is particularly good too.
    I will also point out that having three API-based providers deploying an impractically-large open-weights model beats the pants of having just one. Back in the day, this was called second-sourcing IIRC. With proprietary models, you're at the mercy of one corporation and their Kafkaesque ToS enforcement.
    refulgentis15 hours ago
    You said "Good." then wrote a nice stirring bit about how having a bad experience with a 1T model will force people to try 4B/32B models.
    That seems separate from the post it was replying to, about 1T param models.
    If it is intended to be a reply, it hand waves about how having a bad experience with it will teach them to buy more expensive hardware.
    Is that "Good."?
    The post points out that if people are taught they need an expensive computer to get 1 token/second, much less try it and find out it's a horrible experience (let's talk about prefill), it will turn them off against local LLMs unnecessarily.
    Is that "Good."?
    homarpa day ago
    agentic loop can run all night long. It's just a different way to work: prepare your prompt queue, set it up, check result in the morning, adjust. 'local vibe' in 10h instead of 10mn is still better than 10 days of manual side coding.
    hereme88817 hours ago
    Right on! Especially if its coding abilities are better than Claude 4 Opus. I spent thousands on my PC in anticipation of this rather than to play fancy video games.
    Now, where's that spare SSD...
    maven29a day ago
    You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.
    For GPU inference at scale, I think token-level batching is used.
    zackangeloa day ago
    Typically a combination of expert level parallelism and tensor level parallelism is used.
    For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).
    t1amata day ago
    With 32B active parameters it would be ridiculously slow at generation.
    selfhoster11a day ago
    DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to.
    Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.
    CamperBob28 hours ago
    That's pretty good. Are you running the real 600B+ parameter R1, or a distill, though?
    3 hours ago
    undefined
  - kkzz99a day ago
    According to the bench its closer to Opus, but I venture primarily for English and Chinese.
wiradikusumaa day ago
I've only started using Claude, Gemini, etc in the last few months (I guess it comes with age, I'm no longer interested in trying the latest "tech"). I assume those are "non-agentic" models.
From reading articles online, "agentic" means like you have a "virtual" Virtual Assistant with "hands" that can google, open apps, etc, on their own.
Why not use existing "non-agentic" model and "orchestrate" them using LangChain, MCP etc? Why create a new breed of model?
I'm sorry if my questions sound silly. Following AI world is like following JavaScript world.
- dcrea day ago
  Reasonable question, simple answer: "New breed of model" is overstating it — all these models for years have been fine-tuned using reinforcement learning on a variety of tasks, it's just that the set of tasks (and maybe the amount of RL) has changed over time to include more tool use tasks, and this has made them much, much better at the latter. The explosion of tools like Claude Code this year is driven by the models just being more effective at it. The orchestration external to the model you mention is what people did before this year and it did not work as well.
- oztena day ago
  It is not a silly question. The various flavors of LLM have issues with reliability. In software we expect five 9s, LLMs aren't even a one 9. Early on it was reliability of them writing JSON output. Then instruction following. Then tool use. Now it's "computer use" and orchestration.
  Creating models for this specific problem domain will have a better chance at reliability, which is not a solved problem.
  Jules is the gemini coder that links to github. Half the time it doesn't create a pull request and forgets and assumes I'll do some testing or something. It's wild.
- simonwa day ago
  "Agentic" and "agent" can mean pretty much anything, there are a ton of different definitions out there.
  When an LLM says it's "agentic" it usually means that it's been optimized for tool use. Pretty much all the big models (and most of the small ones) are designed for tool use these days, it's an incredibly valuable feature for a model to offer.
  I don't think this new model is any more "agentic" than o3, o4-mini, Gemini 2.5 or Claude 4. All of those models are trained for tools, all of them are very competent at running tool calls in a loop to try to achieve a goal they have been given.
- selfhoster11a day ago
  > I'm sorry if my questions sound silly. Following AI world is like following JavaScript world.
  You are more right than you could possibly imagine.
  TL;DR: "agentic" just means "can call tools it's been given access to, autonomously, and then access the output" combined with an infinite loop in which the model runs over and over (compared to a one-off interaction like you'd see in ChatGPT). MCP is essentially one of the methods to expose the tools to the model.
  Is this something the models could do for a long while with a wrapper? Yup. "Agentic" is the current term for it, that's all. There's some hype around "agentic AI" that's unwarranted, but part of the reason for the hype is that models have become better at tool calling and using data in their context since the early days.
aliljeta day ago
If the SWE Bench results are to be believed... this looks best in class right now for a local LLM. To be fair, show me the guy who is running this locally...
- selfhoster11a day ago
  It's challenging, but not impossible. With 2-bit quantisation, only about 250-ish gigabytes of RAM is required. It doesn't have to be VRAM either, and you can mix and match GPU+CPU inference.
  In addition, some people on /r/localLlama are having success with streaming the weights off SSD storage at 1 token/second, which is about the rate I get for DeepSeek R1.
fzysingularity8 hours ago
If I had to guess, the OpenAI open-source model got delayed because Kimi K2 stole their thunder and beat their numbers.
- tempaccount4207 hours ago
  Time to RL the hell out of it so it looks better on benchmarks... It's going to be fried.
cyanfa day ago
This is both the largest oss model release thus far, and the largest Muon training run.
viraptor17 hours ago
How well separated are experts per domain in a model like that? Specifically, if I'm interested in a programming use only, could we possibly strip it to one or two of them? Or should I assume a much wider spread? (And there would be some overlap anyway from the original root model)
- renonce16 hours ago
  My experience is that experts are not separated in any intuitive way. I would be very interested (and surprised) if someone manages to prune a majority of experts in a way that preserves model capabilities in a specific domain but not others.
  See https://github.com/peteryuqin/Kimi-K2-Mini, a project that keeps a small portion of experts and layers and keep the model capabilities across multiple domains.
  - viraptor14 hours ago
    Sounds like dumping the routing information from programming questions would answer that... I guess I can do a dump from qwen or deepseek locally. You'd think someone would created that kind of graph already, but I couldn't find one.
    What I did find instead is that some MoE models are explicitly domain-routed (MoDEM), but it doesn't apply to deepseek which is just equally load balanced, so it's unlikely to apply to Kimi. On the other hand, https://arxiv.org/html/2505.21079v1 shows modality preferences between experts, even in mostly random training. So maybe there's something there.
- orbital-decay16 hours ago
  Inseparable, routing is done per token in a statistically optimal way, not per request on the knowledge domain basis.
  - viraptor14 hours ago
    Sure, it's done per token, but the question is: how much do the knowledge domains match up with experts. I could not find hard data on this.
    boroboro411 hours ago
    Check out DeepSeek v3 model paper. They changed the way they train experts (went from aux loss to different kind expert separation training). It did improve experts domain specialization, they have neat graphics on it in the paper.
  - 14 hours ago
    undefined
data_maan8 hours ago
Open source" lol
It's open-weight. As usual, you don't get the dataset, training scripts, etc.
gs17a day ago
> 1T total / 32B active MoE model
Is this the largest open-weight model?
- adta day ago
  No.
  At 1T MoE on 15.5T tokens, K2 is one of the largest open source models to date. But BAAI's TeleFM is 1T dense on 15.7T tokens: https://huggingface.co/CofeAI/Tele-FLM-1T
  You can always check here: https://lifearchitect.ai/models-table/
- bigeaglea day ago
  I believe so.
  Grok-1 is 341B, DeepSeek-v3 is 671B, and recent new open weights models are around 70B~300B.
Alifatiska day ago
Quite impressive benchmark, how come I don't see Kimi in Artificial analysis benchmarks?
Imustaskforhelpa day ago
I really really want to try this model for free since I just don't have a gpu.
Is there any way that I could do so?
Open Router? Or does kimi have their own website? Just curious to really try it out!
- blahgeeka day ago
  Kimi.com
MaxPocka day ago
Would be hilarious if Zuck with his billion dollar poaching failed to beat budget Chinese models.
- physixa day ago
  That reminds me of a thought I had about the poachings.
  The poaching was probably more aimed at hamstringing Meta's competition.
  Because the disruption caused by them leaving in droves is probably more severe than the benefits of having them on board. Unless they are gods, of course.
  - stogot9 hours ago
    I thought that too
- jug8 hours ago
  I can't tell if Kimi is quite top tier, but since Llama 4 performed so poorly then yes, this did in fact happen just now.
- rfooa day ago
  Wikipedia listed a FAIR alumni as cofounder for this "Moonshot AI". Make it funnier probably.
helloericsfa day ago
How does it stack up against the new Grok 4 model?
3811 hours ago
The web chat has extremely low limits FYI. I ran into the limit twice before getting a sane answer and gave up
mistressgabbya day ago
[flagged]