> The Kardashev scale (Russian: шкала Кардашёва, romanized: shkala Kardashyova) is a method of measuring a civilization's level of technological advancement based on the amount of energy it is capable of harnessing and using.
> Under this scale, the sum of human civilization does not reach Type I status, though it continues to approach it.
R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.
If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.
IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience.
* the implication being, anything is possible in that scenario
I will also point out that having three API-based providers deploying an impractically-large open-weights model beats the pants of having just one. Back in the day, this was called second-sourcing IIRC. With proprietary models, you're at the mercy of one corporation and their Kafkaesque ToS enforcement.
That seems separate from the post it was replying to, about 1T param models.
If it is intended to be a reply, it hand waves about how having a bad experience with it will teach them to buy more expensive hardware.
Is that "Good."?
The post points out that if people are taught they need an expensive computer to get 1 token/second, much less try it and find out it's a horrible experience (let's talk about prefill), it will turn them off against local LLMs unnecessarily.
Is that "Good."?
Now, where's that spare SSD...
For GPU inference at scale, I think token-level batching is used.
For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).
Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.
From reading articles online, "agentic" means like you have a "virtual" Virtual Assistant with "hands" that can google, open apps, etc, on their own.
Why not use existing "non-agentic" model and "orchestrate" them using LangChain, MCP etc? Why create a new breed of model?
I'm sorry if my questions sound silly. Following AI world is like following JavaScript world.
Creating models for this specific problem domain will have a better chance at reliability, which is not a solved problem.
Jules is the gemini coder that links to github. Half the time it doesn't create a pull request and forgets and assumes I'll do some testing or something. It's wild.
When an LLM says it's "agentic" it usually means that it's been optimized for tool use. Pretty much all the big models (and most of the small ones) are designed for tool use these days, it's an incredibly valuable feature for a model to offer.
I don't think this new model is any more "agentic" than o3, o4-mini, Gemini 2.5 or Claude 4. All of those models are trained for tools, all of them are very competent at running tool calls in a loop to try to achieve a goal they have been given.
You are more right than you could possibly imagine.
TL;DR: "agentic" just means "can call tools it's been given access to, autonomously, and then access the output" combined with an infinite loop in which the model runs over and over (compared to a one-off interaction like you'd see in ChatGPT). MCP is essentially one of the methods to expose the tools to the model.
Is this something the models could do for a long while with a wrapper? Yup. "Agentic" is the current term for it, that's all. There's some hype around "agentic AI" that's unwarranted, but part of the reason for the hype is that models have become better at tool calling and using data in their context since the early days.
In addition, some people on /r/localLlama are having success with streaming the weights off SSD storage at 1 token/second, which is about the rate I get for DeepSeek R1.
See https://github.com/peteryuqin/Kimi-K2-Mini, a project that keeps a small portion of experts and layers and keep the model capabilities across multiple domains.
What I did find instead is that some MoE models are explicitly domain-routed (MoDEM), but it doesn't apply to deepseek which is just equally load balanced, so it's unlikely to apply to Kimi. On the other hand, https://arxiv.org/html/2505.21079v1 shows modality preferences between experts, even in mostly random training. So maybe there's something there.
It's open-weight. As usual, you don't get the dataset, training scripts, etc.
Is this the largest open-weight model?
At 1T MoE on 15.5T tokens, K2 is one of the largest open source models to date. But BAAI's TeleFM is 1T dense on 15.7T tokens: https://huggingface.co/CofeAI/Tele-FLM-1T
You can always check here: https://lifearchitect.ai/models-table/
Grok-1 is 341B, DeepSeek-v3 is 671B, and recent new open weights models are around 70B~300B.
Is there any way that I could do so?
Open Router? Or does kimi have their own website? Just curious to really try it out!
The poaching was probably more aimed at hamstringing Meta's competition.
Because the disruption caused by them leaving in droves is probably more severe than the benefits of having them on board. Unless they are gods, of course.