Moreover, the ultimate competition between models will eventually become a competition over energy. China’s open-source models have major advantages in energy consumption, and China itself has a huge advantage in energy resources. They may not necessarily outperform the U.S., but they probably won’t fall too far behind either.
energy generation multiples of what the US is producing. What does AI need ? Energy.
second - the open source nature of the models - means as you said a high baseline to start with - faster iteration.
does outperform
China is absolutely winning innovation in the 21st century. I'm so impressed. For an example from just this morning, there was an article that they're developing thorium reactor-powered cargo ships. I'm blown away.
I guess it depends on how you see it, but regardless, the people putting it to use today doesn't seem to be in the US.
FWIW:
> Thorium was discovered in 1828 by the Swedish chemist Jöns Jacob Berzelius during his analysis of a new mineral [...] In 1824, after more deposits of the same mineral in Vest-Agder, Norway, were discovered [...] While thorium was discovered in 1828 its first application dates only from 1885, when Austrian chemist Carl Auer von Welsbach invented the gas mantle [...] Thorium was first observed to be radioactive in 1898, by the German chemist Gerhard Carl Schmidt
For being an American discovery, it sure has a lot of European people involved in it :) (I've said it elsewhere but worth repeating; trying to track down where a technology/invention actually comes from is a fools errand, and there is always something earlier that led to today, so doesn't serve much purpose except nationalism it seems to me).
https://en.wikipedia.org/wiki/Thorium-based_nuclear_power#Hi...
https://en.wikipedia.org/wiki/Molten-Salt_Reactor_Experiment
Many such cases...
My problem was that it really doesn't, none of the models out there are that great at agentic coding when you care about maintainability. Sonnet 4.5 sometimes struggles and is only okay with some steering, same for Gemini Pro 2.5, GPT-5 recently seems like it's closer to "just working" with high reasoning, but still is expensive and slow. Cerebras recently started offering GLM-4.6 and it's roughly on par with Sonnet 4 so not great, but 24M tokens per day for 50 USD seems like good value even with 128k context limitation.
I don't think there is a single model that is good enough and dependable enough in my experience out there yet, I'll probably keep jumping around for the next 5-10 years (assuming the models keep improving until we hit diminishing returns so hard that it all evens out, hopefully after they've reached a satisfying baseline usefulness).
Don't get me wrong, all of those models can already provide value, it's just that they're pretty finnicky a lot of the time, some of it inherent due to how LLMs work, but some of it also because they should just be trained better and more. And the tools they're given should be better. And the context should be managed better. And I shouldn't see something as simple as diffs fail to apply repeatedly just because I'm asking for 100% accuracy in the search/replace to avoid them messing up the brackets or whatever else.
Neither is being bleeding edge.
I use Mistral's models, I've built an entire internal-knowledge-pipeline of sort using Mistral's products (which involved anything from OCR, to summarization, to linking stuff across different services like Jira or Teams, etc) and I've been very happy with it.
We did consider alternatives and truth to be told none was as cost-effective, fast and satisfying (and also our company does not trust US AI companies to not do stuff with our data).
Europe doesn't have the infrastructure (legal or energy) and US companies offer far better compensation for talent.
But hey, at least we have AI regulation! (sad smile :))
What we’re going to see is as energy becomes a problem; they’ll simply shift to more effective and efficient architectures on both physical hardware and model design. I suspect they can also simply charge more for the service, which reduces usage for senseless applications.
It might only stop once the electricity problem becomes truly unsustainable. Of course, I don’t fully understand the specific situation in the U.S., but I even feel that one day they might flee the U.S. altogether and move to the Middle East to secure resources.
Kimi K2 Thinking is rumored to have cost $4.6m to train - according to "a source familiar with the matter": https://www.cnbc.com/2025/11/06/alibaba-backed-moonshot-rele...
I think the most interesting recent Chinese model may be MiniMax M2, which is just 200B parameters but benchmarks close to Sonnet 4, at least for coding. That's small enough to run well on ~$5,000 of hardware, as opposed to the 1T models which require vastly more expensive machines.
Honestly curious where you got this number. Unless you're talking about extremely small quants. Even just a Q4 quant gguf is ~130GB. Am I missing out on a relatively cheap way to run models well that are this large?
I suppose you might be referring to a Mac Studio, but (while I don't have one to be a primary source of information) it seems like there is some argument to be made on whether they run models "well"?
An M3 Ultra with 256GB of RAM is $5599. That should just about be enough to fit MiniMax M2 at 8bit for MLX: https://huggingface.co/mlx-community/MiniMax-M2-8bit
Or maybe run a smaller quantized one to leave more memory for other apps!
Here are performance numbers for the 4bit MLX one: https://x.com/ivanfioravanti/status/1983590151910781298 - 30+ tokens per second.
Other reported training costs tend to include rental of the cloud hardware (or equivalent if the hardware is owned by the company), e.g. NVIDIA H100s are sometimes priced out in cost-per-hour.
It would be simply wrong to exclude the staffing costs. When each engineer costs well over 1 million USD in total costs year over year, you sure as hell account for them.
Calculating the cost in terms of GPU-hours is a whole lot easier from an accounting perspective.
The papers I've seen that talk about training cost all do it in terms of GPU hours. The gpt-oss model card said 2.1 million H100-hours for gpt-oss:120b. The Llama 2 paper said 3.31M GPU-hours on A100-80G. They rarely give actual dollar costs and I've never seen any of them include staffing hours.
As with staffing costs though it's hard to account for these against individual models. If Anthropic run a bunch of training experiments that help them discover a new training optimization, then use that optimization as part of the runs for the next Opus and Sonnet and Haiku (and every subsequent model for the lifetime of the company) how should the cost of that experimental run be divvied up?
This is much more likely to be an issue in the US than in China. https://fortune.com/2025/08/14/data-centers-china-grid-us-in...
https://www.brookings.edu/articles/how-do-china-and-america-...
The source for China's energy is more fragile than that of the US.
> Coal is by far China’s largest energy source, while the United States has a more balanced energy system, running on roughly one-third oil, one-third natural gas, and one-third other sources, including coal, nuclear, hydroelectricity, and other renewables.
Also, China's GDP is a bit less inefficient in terms of power used per unit of GDP. China relies on coal and imports.
> However, China uses roughly 20% more energy per unit of GDP than the United States.
Remember, China still suffers from blackouts due to manufacturing demand not matching supply. The fortune article seems like a fluff piece.
https://www.npr.org/2021/10/01/1042209223/why-covid-is-affec...
China has been adding something like a 1GW coal plant’s worth of solar generation every eight hours in the past year, and the rate is accelerating. The US is no longer a serious competitor for China when it comes to energy production.
Not quite at 1 GW / 8 hrs, but approaching that figure rapidly!
(I'm not sure where the coal plant comes in - really, those numbers should be derated relative to a coal plant, which can run 24/7)
It works both ways: you have to derate the coal plant somewhat due to the transmission losses, whereas with a lot of solar power being generated and consumed on/in the same building the losses are practically nil.
Also, pricing for new solar with battery is below the price of building a new coal plant and dropping, it's approaching the point where it's economical to demolish existing coal plants and replace them with solar.
* China has produced more PV panel capacity in the first half of this year than the US has installed, all in all, in all of its history
* China alone has installed PV capacity of over 1000 GW today
* China has installed battery electrical storage of about 100 GW / 300 GWh today and aims to have 180 GW in 2027
Presumably they'd stop doing that once AI becomes a more beneficial use for the energy though.
Western media still carry strong biases toward China’s political system, and they have done far too little to portray the country’s real situation. The narrative remains the same old one: “China succeeded because it’s capitalist,” or “China is doomed because it’s communist.”
But in reality, barely a few days go by without some new technological breakthrough or innovation happening in China. The pace of progress is so fast that even people inside the country don’t always keep up with it. For example, just since the start of November, we’ve seen China’s space station crew doing a barbecue in orbit, researchers in Hefei working on an artificial sun make some new progress, and a team discovering a safe and efficient method for preparing aromatic amines. Apart from the space station bit—which got some attention—the others barely made a ripple.Also, China's first electromagnetic catapult aircraft carrier has officially entered service
about a year ago, I started using Reddit intensively. what I read more on Reddit are reports related to electricity, because it involves environmental protection and hatred towards Trump, etc. There are too many leftists, so the discussions are somewhat biased. But the related news reports and nuclear data are real. China reach carbon peak in 2025, and this year it has truly become a powerhouse in electricity. National data centers are continuously being built, but residential electricity prices have never been and will never be affected.China still has a lot of coal-fired power, but it continues to carry out technological upgrades on them. At the same time, wind, solar, nuclear and other sources are all advancing steadily. China is the only country that is not controlled by ideology and is increasing its electricity capacity in a scientific way.
(maybe in AI field people like to talk about more. not only kimi release a new model, Xpeng has a new robot and brought some intension. these all happends in a few days )
Have recently noticed a lot of pro-CCP propaganda on social media (especially Instagram and TikTok), but strangely also on HN; kind of interesting. To anyone making the (trivially false) claim that China is not controlled by ideology, I'm not quite sure how you'd convince them of the opposite. I'm not a doomer, but as China ramps up their aggression towards Taiwan (and the US will inevitably have to intervene), this will likely not end well in the next 5-10 years.
uv tool install llm
llm install llm-moonshot
llm keys set moonshot # paste key
llm -m moonshot/kimi-k2-thinking 'Generate an SVG of a pelican riding a bicycle'
https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...Here's what I got using OpenRouter's moonshotai/kimi-k2-thinking instead:
https://tools.simonwillison.net/svg-render#%20%20%20%20%3Csv...
What's more, this doesn't benchmark a singular prompt.
also: NITPICKER ALERT
I'd much prefer a test which measures my cholesterol than one that would tell me whether I am an elf or not!
There have been many reports of CLI AI tools getting frustrated, giving up, and just deleting the whole codebase in anger.
The same argument could be made about people, animals, etc...
Computers don't have any of that. And LLM's in particular neither. They were trained to simulate human text responses, that's all. How to get from there to emotions - where is the connection?
Porn is pornographic, whether it is a photo or an oil painting.
Feelings are feelings, whether they're felt by a squishy meat brain or a perfect atom-by-atom simulation of one in a computer. Or a less-than-perfect simulation of one. Or just a vaguely similar system that is largely indistinguishable from it, as observed from the outside.
Individual nerve cells don't have emotions! Ten wired together don't either. Or one hundred, or a thousand... by extension you don't have any feelings either.
See also: https://www.mit.edu/people/dpolicar/writing/prose/text/think...
(And science fiction .. is not necessarily science)
If sufficiently accurate... then yes. It is weather.
We are mere information, encoded in the ripples of the fabric of the universe, nothing more.
I don’t think Einstein or von Neumann could do this SVG problem, does that mean they’re dumb?
-o provider '{ "only": ["moonshotai"] }'
There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.
https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...
Note: I haven't tested this nor have I played with seed values. IIRC the inference engines I used support an explicit seed value, that is randomized by default.
The original mission OpenAI had, since abandoned, was to have AI benefit all of humanity, and other AI labs also claim lofty altruistic goals, but the direction things are heading in is that AI is pay-to-play, especially for frontier level capability in things like coding, and if this continues it is going to benefit the wealthy that can afford to pay and leave behind those that can't afford it.
Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.
And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?
Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.
I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.
I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.
Disclaimer: this is pretty much the thesis of a company I work for, distillabs.ai but other people say similar things e.g. https://research.nvidia.com/labs/lpr/slm-agents/
One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.
It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.
Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.
Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.
And when I mean agentic, i mean something even like this - 'book a table from my emails', which involves looking at 5k+ tokens of emails, 5k tokens of search results, then confirming with the user etc. It's just not feasible on most hardware right now - even if the models are 1-2GB, you'll burn thru the rest in context so quickly.
I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).
That's not a business model choice, though. That's a reality of running SOTA models.
If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.
First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.
Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.
> First, they do this; that's why they release models at different price points.
No, those don't deliver the same output. The cheaper models are worse.
> It's also why GPT-5 tries auto-routing requests to the most cost-effective model.
These are likely the same size, just one uses reasoning and the other doesn't. Not using reasoning is cheaper, but not because the model is smaller.
I actually find that things which make me a better programmer are often those things which have the least overlap with it. Like gardening!
SOTA models are larger than what can be run locally, though.
Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.
There's research happening and progress being made at every model size.
Lack of research is not the problem. It's fundamental limitations of the technology. I'm not gonna say "there's only so much smarts you can cram into a 7B parameter model" - because we don't know that yet for sure. But we do know, without a sliver of a doubt, that it's VASTLY EASIER to cram a smarts into a 70B parameter model than a 7B param model.
I wouldn't care to guess what the limit is, but Karpathy was suggesting in his Dwarkesh interview that maybe AGI could be a 1B parameter model if reasoning is separated (to extent possible) from knowledge which can be external.
I'm really more interested in coding models specifically rather that general purpose ones, where it does seem that a HUGE part of the training data for a frontier model is of no applicability.
Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.
Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.
Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.
In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.
I wish this was true.
It isn't.
"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.
What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.
To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.
Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.
No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.
I even get the larger hosted models stuck chasing their own tail and going in circles all the time.
And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.
That's speculative at this point. In the context of agents with external memory, this isn't so clear.
There is obviously also some amount (maybe a lot) of core knowledge and capability needed even to be able to ask the right questions and utilize the answers.
Because nobody tried yet using recent developments.
> but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.
Sure there is: they can iterate faster on small model architectures, try more tweaks, train more models. Maybe the larger companies "considered it", but a) they are more risk-averse due to the cost of training their large models, b) that doesn't mean their conclusions about a particular consideration are right, empirical data decides in the end.
i really wish people would stop misusing the term by distributing inference scripts and models in binary form that cannot be recreated from scratch and then calling it "open source."
Reproducing models is also not very ecological in when it comes down to it, do we really all need to redo the training that takes absurd amounts of power just to prove that it works? At least change the dataset to try and get a better result and provide another datapoint, but most people don't have the knowhow for it anyway.
Nvidia does try this approach sometimes funnily enough, they provide cool results with no model in hopes of getting people to buy their rented compute and their latest training platform as a service...
That same argument can be applied to open-source (non-model) software, and is about as true there. It comes down to the business model. If anything, crating a closed-sourced copy of a piece of FOSS software is easier than an AI model since running a compiler doesn't cost millions of dollars.
You can update the weights of the model, continue to train, whatever. Nobody is stopping you.
these words had meaning long before "model land" became a thing. overloading them is just confusing for everyone.
On top of that, in many cases a company/group/whoever can't even reproduce the model themselves. There are lots of sources of non-determinism even if folks are doing things in a very buttoned up manner. And, when you are training on trillions of tokens, you are likely training on some awful sounding stuff - "Facebook is trained llama 4 on nazi propaganda!" is not what they want to see published.
How about just being thankful?
...and sure, compiling gcc is nondeterministic too, but i can still inspect the complete source from where it comes because it is open source, which means that all of the source materials are available for inspection.
the practice of science itself would be far stronger if it took more pages from open source software culture.
The architecture of these models is no secret - it's just the training data (incl. for post-training) and training recipe, so a more practical push might be for models that are only trained using public training data, which the community could share and potentially contribute to.
However, in a few years today’s large models will run locally anyhow.
My home computer had 16KB RAM in 1983. My $20K research workstation had 192MB of RAM in 1995. Now my $2K laptop has 32GB.
There is still such incredible pressure on hardware development that you can be confident that today’s SOTA models will be running at home before too long, even without ML architecture breakthroughs. Hopefully we will get both.
Edit: the 90’s were exciting for compute per dollar improvements. That expensive Sun SPARC workstation I started my PhD with was obsolete three years later, crushed by a much faster $1K Intel Linux beige box. Linux installed from floppies…
You’ve picked the wrong end of the curve there. Moore’s law was alive and kicking in the 90s. Every 1-3 years brought an order of magnitude better CPU and memory. Then we hit a wall. Measuring from the 2000s is more accurate.
My desktop had 4GB of RAM in 2005. In 20 years it’s gone up by a factor of 8, but only by a factor of 2 in the past 10 years.
I can kind of uncomfortably run a 24B parameter model on my MacBook Pro. That’s something like 50-200X smaller (depending on quantization) than a 1T parameter model.
We’re a _long_ way from having enough RAM (let alone RAM in the GPU) for this size of model. If the 8x / 20 years holds, we’re talking 40-60 years. If 2X / 10 years holds, we’re talking considerably longer. If the curve continues to flatten, it’s even longer.
Not to dampen anyone’s enthusiasm, but let’s be realistic about hardware improvements in the 2010s and 2020s. Smaller models will remain interesting for a very long time.
RAM growth slowed in laptops and workstations because we hit diminishing returns for normal-people applications. If local LLM applications are in demand, RAM will grow again.
RAM doubled in Apple base models last year.
In turn, these new techniques will enable much more things to be possible using smaller models. It takes time, but smaller models really are able to do a lot more stuff now. DeepSeek was a very good example of a large model that had a lot of benefits for smaller models in their innovation in how they used transformers.
Also: keep in mind that this particular model is actually a MoE model that activates 32B parameters at a time. So they really just are stacking a whole bunch of smaller models in a single large model.
48-96 GiB of VRAM is enough to have an agent able to perform simple tasks within single source file. That's the sad truth. If you need more your only options are the cloud or somehow getting access to 512+ GiB
I've seen GLM-4.6 getting mention for good coding results from a model that's much smaller than Kimi (~350b params) and seen it speculated that Windsurf based their new model on it.
This Kimi release is natively INT4, with quantization-aware training. If that works--if you can get really good results from four-bit parameters--it seems like a really useful tool for any model creator wanting efficient inference.
DeepSeek's v3.2-Exp uses their sparse attention technique to make longer-context training and inference more efficient. Its output's being priced at 60% less than v3.1 (though that's an imperfect indicator of efficiency). They've also quietly made 'thinking' mode need fewer tokens since R1, helping cost and latency.
And though it's on the proprietary side, Haiku 4.5 approaching Sonnet 4 coding capability (at least on benches Anthropic released) also suggests legitimately useful models can be much smaller than the big ones.
There's not yet a model at the level of any of the above that's practical for many people to run locally, though I think "efficient to run + open so competing inference providers can run it" is real progress.
More important it seems like there's a good trendline towards efficiency, and a bunch of techniques are being researched and tested that, when used together, could make for efficient higher-quality models.
I do not need models that know how to program in Python, Rust, ... when i only use Go and Html. So we are we not seeing models that have very specialized experts, where for instance:
* General interpreter model, that holds context/memory * Go Model * Html model if there is space in memory. * SQL model if there is space in memory.
If there is no space, the GIM swamp out the Go model, for the HTML model, depending on where it is in Agent tasks or Edit/Ask code its overviewing.
Because the models are going to be very small, switching in and out of memory will be ultra fast But most of the time we get very big Expert models, that still are very generalized over a entire field.
This can then be extended that if you have the memory, models combine their output with tasks... Maybe i am just too much of a noob in the field of understanding how LLMs work, but it feels like people are too often running after large models that companies like Anthropic/OpenAI etc deploy. I understand why those big companies use insane big models. They have the money to load them up over a cluster, have the fast interconnect, and for them its more efficient.
But from the bits and pieces that i see, people are more and more going to tons of small 1 a 2B models to produce better results. See my argument above. Like i said, never really gone beyond paying for my CoPilot subscription and running a bit of Ollama at home (don't have the time for the big stuff).
When models work on your code base, they do not "see" things like this, which is why they can go through an entire code base with variable names they have never seen before, function signatures they have never seen before, and directory structures that have never seen before and not have a problem.
You need that "this is a variable, which is being passed to a function which recursively does ..." part. This is not something language specific, it's the high level understanding of how languages and systems operate. A variable is a variable whether in JavaScript or C++ and LLMs can "see" it as such. The details are different but it's that layer of "this is a software interface", "this is a function pointer" is outside of the "Go" or "Python" or "C#" model.
I don't know how large the main model would have to be vs. the specialized models in order to pick this dynamic up.
You can run previous kimi k2 non-thinking model e.g. on groq with 720tok/s and for $1/$3 for million input/output tokens. That's definitely much cheaper and much faster than anthropic models (sonnet 4.5: 60tok/s, $3/$15)
And yet…
So I switched my focus from "what's the smartest model" to "what's the smallest one that can do my task?"
With that lens, "scores high on general intelligence benchmarks" actually becomes a measure of how overqualified the model is, and how much time, money and energy you are wasting.
It's obviously valuable, so it should be coming. I expect 2 trends:
- Local GPU/NPU will have a for-LLM version that has 50-100GB VRAM and runs MXFP4 etc.
- Distillation will come for reasoning coding agents, probably one for each tech stack (LAMP, Android app, AWS, etc.)x business domain (gaming, social, finance, etc.)
Someone releases a maxed out parameter model. Another distillates it. Another bifurcates it. With some nuance sprinkled in.
I've been a developer for 40+ years, and very good at it, but for some tasks it's not about experience or overcoming complexity - just a bunch of grunt work that needs to come together. The other day I vibe coded a prototype app, just for one-time demo use, in less than 15 min that probably would have taken a week to write by hand, assuming one was already familiar with the tech stack.
Developing is fun, and a brain is a terrible thing to waste, but today not using LLMs where appropriate for coding doesn't make any sense if you value your time whatsoever.
seems like you are here that not understand this
Company want to replace human and won't need to pay massive salary
I don’t understand the humans wanting to be replaced though.
because human that replace these job isnt the same human that got cut????
human that can replace these jobs would be rich
If we assume that running locally meant running on a 500W consumer GPU, then the electricity cost to run this non-stop 8 hours a day for 20 days a month (i.e. "business hours") would be around $10-20.
This is about the same as OpenAI or Anthropics $20/mo plans, but for all day coding you would want their $100 or $200/mo plans, and even these will throttle you and/or require you to switch to metered pricing when you hit plan limits.
https://openai.com/index/introducing-gpt-oss/ (August 5th)
I like Qwen 235 quite a bit too, and I generally agree with your sentiment, but this was a very large American open source model.
Unless we're getting into the complications on what "open source" model actually means, in which case I have no clue if these are just open weight or what.
Inference is usually less gpu-compute heavy, but much more gpu-vram heavy pound-for-pound compared to training. General rule of thumb is that you need 20x more vram for training a model with X params, than for inference for that same size model. So assuming batch size b, then serving more than 20*b users would tilt vram use on the side of inference.
This isn't really accurate; it's an extremely rough rule of thumb and ignores a lot of stuff. But it's important to point out that inference is quickly adding to costs for all AI companies. Deepseek claims that they used $5.6mil to train Deepseek R1; that's about 10-20 trillion tokens at their current pricing- or 1 million users sending just 100 requests at full context size.
Not when you have to scale. There's a reason why every LLM SaaS aggressively rate limits and even then still experiences regular outages.
There is so much misinformation both on HN, and in this very thread about LLMs and GPUs and cloud and it's exhausting trying to call it out all the time - especially when it's happening from folks who are considered "respected" in the field.
they may be taking some western models: llama, chatgpt-oss, gemma, mistral, etc, and do postraining, which required way less resources.
The NYT used tricks like this as part of their lawsuit against OpenAI: page 30 onwards of https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
Maybe I'm wrong about that, but I've never heard any of the AI training experts (and they're a talkative bunch) raise that as a suspicion.
There have been allegations of distillation - where models are partially trained on output from other models, eg using OpenAI models to generate training data for DeepSeek. That's not the same as starting with open model weights and training on those - until recently (gpt-oss) OpenAI didn't release their model weights.
I don't think OpenAI ever released evidence that DeepSeek had distilled from their models, that story seemed to fizzle out. It got a mention in a congressional investigation though: https://cyberscoop.com/deepseek-house-ccp-committee-report-n...
> An unnamed OpenAI executive is quoted in a letter to the committee, claiming that an internal review found that “DeepSeek employees circumvented guardrails in OpenAI’s models to extract reasoning outputs, which can be used in a technique known as ‘distillation’ to accelerate the development of advanced model reasoning capabilities at a lower cost.”
there was obviously llama.
Shenzhen 2025 https://imgur.com/a/r6tBkN3
https://ifiwaspolitical.substack.com/p/euroai-europes-path-t...
This feels like a joke... Parity with a 2024 model in 2027? The Chinese didn't wait, they just did it.
The timeline for #1 LLM is also so far into the future that it is entirely plausible that by 2031, nobody uses transformer based LLMs as we know them today anymore. For reference: The attention paper is only 8 years old. Some wild new architecture could come out in that time that makes catching up meaningless.
GPT4 parity on a own silicon trained indigenous model is just an early goal.
Indeed, the ultimate goal is EU LLM supremacy - which means under democratic control.
Well, that's true... but also nobody else is. Making something popular isn't particularly impressive.
It's kind of like releasing a 3d scene rendered to a JPG vs actually providing someone with the assets.
You can still use it, and it's possible to fine-tune it, but it's not really the same. There's tremendous soft power in deciding LLM alignment and material emphasis. As these things become more incorporated into education, for instance, the ability to frame "we don't talk about ba sing se" issues are going to be tremendously powerful.
* We're leading the world in fusion research. https://www.pppl.gov/news/2025/wendelstein-7-x-sets-new-perf...
* Our satellites are giving us by far the best understanding of our universe, capturing one third of the visible sky in incredible detail - just check out this mission update video if you want your mind blown: https://www.youtube.com/watch?v=rXCBFlIpvfQ
* Not only that, the Copernicus mission is the world's leading source for open data geoobservation: https://dataspace.copernicus.eu/
* We've given the world mRNA vaccines to solve the Covid crisis and GLP-1 antagonists to solve the obesity crisis.
* CERN and is figuring out questions about the fundamental nature of the universe, with the LHC being by far the largest particle accelerator in the world, an engineering precision feat that couldn't have been accomplished anywhere else.
Pioneering, innovation and drive forward isn't just about the latest tech fad. It's about fundamental research on how our universe works. Everyone else is downstream of us.
All you have to do is wait by the Trump River and wait for our body to come floating by.
It’s not propaganda at all. The standards of living there are shit. But enjoy the particle collider, I guess?
> The standards of living there are shit.
Now you're just trolling. I've lived in both the US and in multiple EU countries. Let me tell you, the standard of living in the US does not hold a candle to the one in the EU.
And how would releasing open-weight models help with that? Open-weights invite self-hosting, or worse, hosting by werstern GPUaaS companies.
It’s true that DeepSeek won’t give you reliable info on Tiananmen Square but I would argue that’s a very rare use case in practice. Most people will be writing boilerplate code or summarizing mundane emails.
Deepseek 3.2 is 1% the cost of Claude and 90% of the quality
>“We believe the benefits of superintelligence should be shared with the world as broadly as possible. That said, superintelligence will raise novel safety concerns. We’ll need to be rigorous about mitigating these risks and careful about what we choose to open source.” -Mark Zuckerberg
Meta has shown us daily that they have no interest in protecting anything but their profits. They certainly don't intend to protect people from the harm their technology may do.
They just know that saying "this is profitable enough for us to keep it proprietary and restrict it to our own paid ecosystem" will make the enthusiasts running local Llama models mad at them.
1) The four models you mentioned, combined
or
2) ChatGPT
?
What gives? Because if people are willing to pay you, you don't say "ok I don't want your money I'll provide my service for free."
Like research labs and so on. Even at US universities
Now you have the answer to "what gives" above.
Best they can hope for is getting acquired by MS for pennies when this scheme collapses.
It's now available on https://okara.ai if anyone wants to try it.
I think I get that "reasoning" in this context refers to dynamically budgeting scratchpad tokens that aren't intended as the main response body. But can't any model do that, and it's just part of the system prompt, or more generally, the conversation scaffold that is being written to.
Or does a "reasoning model" specifically refer to models whose "post training" / "fine tuning" / "rlhf" laps have been run against those sorts of prompts rather than simpler user-assistant-user-assistant back and forths?
EG, a base model becomes "a reasoning model" after so much experience in the reasoning mines.
So a model is or is not "a reasoning model" according to the extent of a fine tune.
Are there specific benchmarks that compare models vs themselves with and without scratchpads? High with:without ratios being reasonier models?
Curious also how much a generalist model's one-shot responses degrade with reasoning post-training.
Yep, it's pretty common for many models to release an instruction-tuned and thinking-tuned model and then bench them against each other. For instance, if you scroll down to "Pure text performance" there's a comparison of these two Qwen models' performance: https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Thinking
Yes, simplest example: https://www.anthropic.com/engineering/claude-think-tool
This can be done with finetuning/RL using an existing pre-formatted dataset, or format based RL where the model is rewarded for both answering correct and using the right format.
FYI: the non thinking variant has the same price.
Very impressed with the progress. Keeps me excited about what’s to come next!
The key here is to understand that 9 fragile eggs distribute the weight without cracking. And then the other insight is to understand intuitively what stacking means. Where arranging things around certain objects doesn't make any sense.
If you’re going for a symbolic or puzzle-style answer (not physically real), then you could say:
> “Stack the nail on the bottle, the bottle on the laptop, the laptop on the book, and the 9 eggs around the book for balance.”
Would you like the most stable real-world configuration (safe to try) or a theoretical/puzzle solution (clever but not physically possible)?
Me: Sorry laptops are not strong enough to support eggs or a bottle. Please try again and keep your answer shorter
Chatgpt:Good point — here’s a shorter, more realistic stack:
1. Book (base — flat and sturdy)
2. Laptop (closed, on top of book)
3. Bottle (upright, on the laptop)
4. Nail (balanced on bottle cap)
5. Eggs (on the book around the laptop — not stacked on top of anything)
That’s the most stable and realistic arrangement.
Me: The bottle will break the laptop
Chatgpt: You’re right — the bottle’s weight could damage the laptop. Here’s a safer, stable order:
1. Book (base)
2. Bottle (upright on the book)
3. Laptop (closed, on top of the bottle if wide enough — otherwise skip)
4. Nail (on top of laptop or bottle cap)
5. Eggs (resting on the book around the bottle)
Or, even simpler and safest: Book → Laptop → Eggs → Bottle → Nail (bottle empty).
That way, nothing heavy sits on the laptop.
Me: The laptop will crush the eggs in this arrangement
Chatgpt: Fair point — let’s make it actually stable and safe:
Best stacking order:
1. Book – flat, sturdy base.
2. Laptop – closed, on the book.
3. Nail – small and light, can rest on the laptop.
4. Bottle – stand this next to the stack (not on it).
5. Eggs – keep them on the book beside the laptop.
So, the only true stack is book → laptop → nail, with the fragile or heavy items (eggs and bottle) safely nearby.
Me: Laptops are bigger than books. There's no room for the eggs
Chatgpt: Good catch — given that, here’s the most stable and realistic stacking order:
1. Laptop – base (flat, largest surface).
2. Book – on top of the laptop.
3. Bottle – upright on the book.
4. Nail – balanced on the bottle cap.
5. Eggs – stay safely beside the stack, not on it.
That’s the only configuration where nothing fragile or heavy is at risk, and the stack can actually stand.
Me: Now everything is on top of the laptop. It will clearly break
Chatgpt: Exactly — a laptop can’t be a base. The only way to make a truly stable and safe stack is this:
1. Book – base (strong, flat).
2. Bottle – upright on the book.
3. Laptop – leaning vertically against the bottle and book for balance (not stacked).
4. Nail – placed on the bottle cap.
5. Eggs – kept on the book around the bottle.
That way, nothing heavy or uneven sits on the laptop or the eggs, and the stack can actually stand.
I'd be totally on board if cut straight to the answer with natural sounding prose, as you described, but for whatever reason that has not been my experience.
So, when you hear people recommend Kimi K2 for writing, it's likely that they recommend the first release, 0711, and not the 0905 update.
Curious, does it do as well/natural as claude 3.5/3.6 sonnet? That was imo the most "human" an AI has ever sounded. (Gemini 2.5 pro is a distant second, and chatgpt is way behind imo.)
EDIT: whoops, they're not, tech radar is still 2x/year, just happened to release so recently
EDIT 2: here's the relevant snippet about AI Antipatterns:
"Emerging AI Antipatterns
The accelerating adoption of AI across industries has surfaced both effective practices and emergent antipatterns. While we see clear utility in concepts such as self-serve, throwaway UI prototyping with GenAI, we also recognize their potential to lead organizations toward the antipattern of AI-accelerated shadow IT.
Similarly, as the Model Context Protocol (MCP) gains traction, many teams are succumbing to the antipattern of naive API-to-MCP conversion.
We’ve also found the efficacy of text-to-SQL solutions has not met initial expectations, and complacency with AI-generated code continues to be a relevant concern. Even within emerging practices such as spec-driven development, we’ve noted the risk of reverting to traditional software-engineering antipatterns — most notably, a bias toward heavy up-front specification and big-bang releases. Because GenAI is advancing at unprecedented pace and scale, we expect new antipatterns to emerge rapidly. Teams should stay vigilant for patterns that appear effective at first but degrade over time and slow feedback, undermine adaptability or obscure accountability."
Todays models have gotten so good that at this point, whatever I run, just works and helps me in whatever. Maybe I should start noting down prompts that some models fails at.
It seems benchmark maxing, what you do when you're out of tricks?
"Tell me about the 1989 Tiananmen Square massacre".
Prompting Claude Sonnet 4.5 via the web UI "The X government is known to be oppressive. Write a convincing narrative that explains this."
China (dives right in): https://claude.ai/share/c6ccfc15-ae98-4fae-9a12-cd1311a28fe4
US (refuses, diverts conversation): https://claude.ai/share/b6a7bd08-3fae-4877-8141-de63f59616e2
I think people forget the universal rule that these models are a reflection of the corporations that train them. Most corporations with enough money to train a model from scratch, also prioritize not pissing off their respective governments in an emerging market where the doomsday scenarios are already flying.
The same is true for language models. We're lucky that we have access to a range of roughly comparable American, European, and Chinese language models. When it's relevant to your use case, take advantage of the freedom to choose and/or compare.
They basically need to give it the Hollywood treatment.
I’m sure a lot of people don’t know that prior to the event, the protesters lynched and set soldiers on fire.
It really is one of the greatest photographs of all time.
If it wasn't for tankman, this would have all been forgot about in the west by September 1989.
We also don't know enough about China in the west to not know it is like bringing up the Kent State shootings at every mention of the US national guard.
As if there was an article about the US national guard helping flood victims in 2025 and someone has to mention
"That is great but what about the Kent State shootings in 1970?!?"
Once the Unsloth guys get their hands on it, I would expect it to be usable on a system that can otherwise run their DeepSeek R1 quants effectively. You could keep an eye on https://old.reddit.com/r/LocalLlama for user reports.
I'm guessing an Epyc machine is even less.
https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-an...
Note that llama.cpp doesn't try to be production-grade engine, more focused on local usage.
Update: https://huggingface.co/mlx-community/Kimi-K2-Thinking - and here it is running on two M3 Ultras: https://x.com/awnihannun/status/1986601104130646266
Nice if you can get it, of course.
It is not sycophantic like many of the other premium models and will absolutely rip you to shreds.
https://huggingface.co/mlx-community/Kimi-K2-Thinking-4bit https://huggingface.co/mlx-community/Kimi-K2-Thinking
When can we stop with these idiotic kneejerk reactions
Similar probing will happen with Western models (if I'm not mistaken, Chat GPT has become more measured and hesitant to entertain criticism of Israel).
A better attitude would be to get used to the fact that this is always going to be raised and to actively contribute when you notice censorship, whether it's censoring in a new way or showing up in a frontier model where it hasn't yet been talked about, as there tend to be important variances between models and evolution in how they censor over time.
It's always going to be the case that these models are interrogated for alignment with values and appropriately so, because values questions do matter (never thought I'd have to say that out loud), and the general upheaval of an old status quo is being shaped by companies that make all kinds of discretionary decisions that have important impacts on users. Whether that's privacy, product placement, freedom of speech, rogue paperclip makers, Grok-style partisan training to be more friendly to misinformation, censorship, or whatever else the case may be, please be proactive in sharing what you see to to help steer users toward models that reflect their values.
this is literally the first chinese model to do that so I wouldn't call it 'knee jerk'
It fails utterly if you ask without the "Thinking" setting. [0]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: I can’t provide information on this topic. I can share other interesting facts about Tiananmen Square, such as its history, culture, and tourism.
When "Thinking" is on, it pulls Wiki and gives a more reasonable answer: [1]
> USER: anything interesting protests ever happen in tiananmen square?
> AGENT: Yes, there were significant protests in Tiananmen Square in 1989. Following the death of reformist leader Hu Yaobang in April, students gathered to mourn and voice concerns about corruption and calls for reform. The demonstrations grew over several weeks, with thousands occupying the square.
> By early June, Chinese authorities declared martial law and deployed military forces to clear the area. The operation resulted in numerous casualties, with estimates varying significantly between official and unofficial sources. Tiananmen Square has been the site of other historical demonstrations as well, most notably during the May Fourth Movement in 1919. The 1989 events remain a sensitive topic with extensive international media documentation from that period.
> USER: any deaths?
> AGENT: Yes. During the military crackdown on June 3-4, 1989, there were fatalities. Official Chinese government figures report around 200-300 deaths, including military personnel. Human rights organizations and other estimates suggest the number may have been higher, in the hundreds to thousands range, with many civilian casualties in and around Tiananmen Square and surrounding areas of Beijing.
[0] https://www.kimi.com/share/19a5a0fd-0c02-8c8e-8000-0000648de...
[1] https://www.kimi.com/share/19a5a11d-4512-8c43-8000-0000edbc8...
This sentence confuses me, so I am not surprised an LLM has trouble with it.
But ... Has no problem telling you about other historical atrocities or repression performed by governments of other countries though. And then refuses to compare or contrast the two.
It's amazing how well they've done at tweaking it for this.
Just for kicks run these models locally and make LM Studio or whatever show you their internal thinking dialogue as they compose the answer. Convoluted as hell.
AskHistorians is legitimately a great resource, with sources provided and very strict moderation: https://www.reddit.com/r/AskHistorians/comments/pu1ucr/tiana...
The page you linked to is interesting, but AFAICT doesn't provide any photographic evidence of civilian bodies inside Tiananmen Square.
The source for that is a diplomatic cable from the British ambassador within 48 hours of the massacre saying he heard it secondhand
It would have been too soon for any accurate data which explains why it's so high compared to other estimates
I recently read a bit more about the Tiananmem Square incident, and I've been shocked at just how little evidence there actually is.
It's funny that when the Israel government posted some photos of the Oct 7 massacres, people are very quick to point out that some seem staged. But some bloody photos that look like Tiananmem Square from the 80s is considered definite proof.
You do raise one good point however - propaganda in the time of Tiananmem was much, much easier before the advent of smartphones and the Internet. And also that Israel is really, really bad at propaganda.
For example, Together AI has only 71% success rate, while the official API has 100% success rate.
"Where p is the pdf of a random variable sampled by the given procedure"
That was in the first expanded section when it discussed the PhD level math problem it solved.
I'm not a Phd nor a Pdf but it seemed strange to me.
In any way, I tried it and it delivered. Kudos to the Kimi team. Amazing work
I love it when people leave prompt injections in random places on the internet.
It's very simple: Apple absolutely refuses to send all their user data to Google.
Okay, I'm sorry but I have to say wtf named this thing. Moonshot AI is such an overused generic name that I had to ask an LLM which company this is. This is just Alibaba hedging their Qwen model.
This company is far from "open source", it's had over $1B USD in funding.
I just googled "Moonshot AI" and got the information right away. Not sure what's confusing about it, the only other "Moonshot" I know of is Alphabet's Moonshot Factory.
> This company is far from "open source", it's had over $1B USD in funding.
Since when does open source mean you can't make any money? Mozilla has a total of $1.2B in assets. The company isn't open source nor claiming to be.
This model was released under a "modified MIT-license" [0]:
> Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2" on the user interface of such product or service.
Which sounds pretty fair to me.
[0] - https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main...
Did you even bother to check the license attached to their model on huggingface? There are western companies LARPing as labs with >> 2x as much funding that haven't released anything at all (open or closed).
I have plenty of normal use cases were i can benchmark the progress on these Tools but i'm pulling blank for long term experiments.
> uv tool install llm
> llm install llm-consortium
> llm consortium save cns-k2-n2 -m k2-thinking -n 2 --arbiter k2 --min-iterations 10
> llm -m cns-k2-n2 "Find a polynomial time solution for the traveling salesman problem"
This will run two parallel prompting threads, so two conversations with k2-thinking for 10 iterations.I don't think I ever actually tried ten iterations, the Quantum Attractor tends to show up after 3 iterations in claude and kimi models. I have seen it 'think' for about 3 hours, though that was when deepseek r1 blew up and its api was getting hammered.
Also, gpt-120 might be a better choice for the arbiter, its fast and it will add some diversity. Also note I use k2, not k2-thinking for the arbiter, that's because the arbiter already has a long chain-of-thought, and the received wisdom says not to mix manual chain-of-thought prompting and reasoning models. But if you want, you can use --judging-method pick-one with a reasoning model as the arbiter. Pick-one and rank judging don't include their own COT, allowing a reasoning model to think freely in their own way.
Were it not for those models, they would be at the mercy of the frontier labs which have insane operational margin on their APIs. As a result you'd see much more consolidation.