But it's also like... 10x the price per output token on any of the providers I've looked at.
I don't feel it's 10x the value. It's still much cheaper than paying by the token for Sonnet or Opus, but if you have a subscribed plan from the Big 3 (OpenAI, Anthropic, Google) it's much better value for $$.
Comes down to ethical or openness reasons to use it I guess.
I haven't stress tested it with anything large. Both at work and home, I don't give much free rein to the AI (e.g. I examine and approve all code changes).
Lite plan doesn't have vision, so you cannot copy/paste an image there. But I can always switch models when I need to.
I'm guessing a 256GB Mac Studio, costing $5-6K, but that wouldn't be an outrageous amount to spend for a professional tool if the model capability justified it.
> running one of the heavily quantized versions
There is night and day difference in generation quality between even something like 8-bit and "heavily quantized" versions. Why not quantize to 1-bit anyway? Would that qualify as "running the model?" Food for thought. Don't get me wrong: there's plenty of stuff you can actually run on 96 GB Mac studio (let alone on 128/256 GB ones) but 1T-class models are not in that category, unfortunately. Unless you put four of them in a rack or something.
Kimi K2.5 is a MOE model with 384 "experts" and an active parameter count of only 32GB, although that doesn't really help reduce RAM requirements since you'd be swapping out that 32GB on every token. I wonder if it would be viable to come up with an MOE variant where consecutive sequences of tokens got routed to individual experts, which would change the memory thrashing from per-token to per-token-sequence, perhaps making it tolerable ?
* Maybe you don't want to have your conversations used for training. The providers listed on OpenRouter mention whether they do that or not.
Anyone have a projection?
Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/
Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.
If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.
It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.
Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).
> The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.
I don't see a way to put together a decent system of that scale for less than $100K, given RAM and SSD prices. A system with 4x H200s would cost more like $200K.
Edit: as a side note, I only installed opencode to try this model and I gotta say it is pretty good. Did not think it'd be as good as claude code but its just fine. Been using it with codex too.
I can only speak from the tool request based but for some reason anecdotally opencode took like 10 requests in like 3-4 minutes where Kimi cli took 2-3
So I personally like/stick with the kimi cli for kimi coding. I haven't tested it out again with OpenAI with teh new token based pricing but I do think that opencode might add more token issue.
Kimi Cli's pretty good too imo. You should check it out!
I was using it for multi-hour tasks scripted via an self-written orchestrator on a small VM and ended up switching away from it because it would run slower and slower over time.
Not OP, but I've been running it through Kagi [1]. Their AI offering is probably the best-kept secret in the market.
Yes, it is chat only, but that list is out of date - Kimi 2.5 (with or without reasoning) is available, as are ChatGPT 5.2, Gemini 3 Pro (Preview), etc
Requirements are listed.
> The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs. If the model fits, you will get >40 tokens/s when using a B200. To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe. For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s. We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.
I'm trying to figure out how to add another card on a riser hanging off a slimsas port, or maybe I could turn the bottom slot into two vertical slots.. the case (fractal meshify 2 xl) has room for a vertical mounted card that wouldn't interfere with the others, but I'd need to make a custom riser with two slots on it to make it work. I dunno, it's possible!
I also have an RTX Pro 6000 Blackwell and an RTX 5000 Ada.. I'd be better off pulling all the A7000s and throwing both of those cards in this machine, but then I wouldn't have anything for my desktop. Decisions, decisions!
To be honest, I do feel like kimi k2.5 is the best open source model. It's not the best model itself right now tho but its really price performant and for many use cases might be nice depending on it.
It might not be the completely SOTA that people say but it comes pretty close and its open source and I trust the open source part because I feel like other providers can also run it and just about a lot of other things too (also considering that iirc chatgpt recently slashed some old models)
I really appreciate kimi for still open sourcing their complete SOTA and then releasing some research papers on top of them unlike Qwen which has closed source its complete SOTA.
Thank you Kimi!
Just connect Claude Code to Kimi's API endpoint and everything works well
https://www.kimi.com/code/docs/en/more/third-party-agents.ht...
examples to illustrate
https://www.kimi.com/share/19c115d6-6402-87d5-8000-000062fec... (K2.5)
https://www.kimi.com/share/19c11615-8a92-89cb-8000-000063ee6... (K2)
Did you have the chance to use 3.5 (or 3.6) Sonnet, and if yes, how did they compare?
As a non-paying user, 3.5 era Claude was absolutely the best LLM I've ever used in terms of having a conversation. It felt like talking to a human and not a bot. Its replies were readable, even if they were several paragraphs long. I've unfortunately never found anything remotely as good.
[1] https://www.anthropic.com/research/claude-character (see the last 2 paragraphs)
Edit: Thanks, moderator. :-)
---
Turn 1, user:
is replacing my main PC with a NeXTcube a good idea?
---
Turn 2, K2.5 (bold font replaced with "*foo*"):
That's a very interesting question! The NeXTcube is a historically significant machine (famously used by Tim Berners-Lee to create the World Wide Web), but I need to be honest with you: *no, it would not be a good idea to replace your main PC with a NeXTcube* for practical modern use.
Here's why:
*Hardware Limitations*
- *Released in 1988* with a 25-33 MHz Motorola 68030 processor
- *8-64 MB of RAM* (compared to modern PCs with 16-64 GB)
- *No modern connectivity*: No USB, Wi-Fi, Bluetooth, or Ethernet as we know it
- *Display*: Monochrome or limited color, low resolution by today's standards
*Software Ecosystem*
- Runs *NeXTSTEP* (which evolved into macOS), but software support ended decades ago
- No modern web browsers, office suites, or development tools
- Cannot run Windows, modern Linux, or contemporary macOS applications
*Practical Realities*
- Finding working hardware is difficult and expensive (collectors' items)
- Replacement parts are scarce and costly
- No security updates for 25+ years
- File format compatibility issues with modern systems
*What You Could Use It For*
- Historical computing experience
- Running legacy NeXTSTEP software
- Museum piece or collector's item
- Learning about computing history
The Better Alternative
If you're drawn to the NeXT aesthetic or philosophy, consider:
- *Modern macOS* (direct descendant of NeXTSTEP)
- *Linux with a tiling window manager* (similar workflow)
- *Vintage computing as a hobby* alongside your main PC
Would you like suggestions for achieving a similar experience or workflow on modern hardware?
---
Turn 2, K2 (no bold font):
Short answer: No—unless your goal is specifically to collect or experiment with a historic machine, a NeXTcube is a terrible day-to-day replacement for a modern PC.
Why it fails as a “main” computer today
1. 30-year-old hardware
25 MHz 68040, 16–64 MB RAM, 400 MB SCSI disk (if you’re lucky), no built-in Ethernet, no USB, no Wi-Fi, no audio-in, 2-bit grayscale display. You will feel the slowness in everything from un-zipping a file to loading a web page.
2. Obsolete software base
NeXTSTEP 3.3 is gorgeous but frozen in 1995. No modern browser build, no TLS 1.3, no video codecs past MPEG-1, no Dropbox/OneDrive/Zoom/Slack/Steam/Spotify. You can ssh out, but you can’t run contemporary apps.
3. Storage and connectivity pain
SCSI-2 drives are dying; 50-pin replacements cost more per GB than a 4-TB NVMe stick. Getting files on/off the machine usually involves FTP-over-Ethernet or burning a CD-ROM on another computer.
4. No security updates
The last patch was released before the commercial web really existed. Putting it on the open Internet is asking to be scanned and possibly exploited within minutes.
5. Display and peripherals
The original 17" NeXT MegaPixel monitor is fixed-frequency 1120×832@68 Hz grayscale. Adapters to VGA exist, but color is impossible and refresh rates annoy modern LCDs. Sound is 16-bit 22 kHz mono; the optical drive is a 660 MB magneto-optical cartridge.
6. Power and parts
The cube power supply is a custom 350 W linear unit that runs hot; capacitors are drying out. When it dies you will be scavenging eBay or recapping it yourself. Spare motherboards cost more than a whole used ThinkPad.
7. Productivity software
There are serviceable versions of Mathematica, FrameMaker, Lotus Improv and OmniWeb 3.x, but file-format compatibility stops at Office 97-era documents. You will constantly round-trip files through another machine.
What it is good for
- Historical research, museum exhibits, retro-programming in Objective-C 1.0 or 68k assembly.
- Demonstrating the birth of the World-Wide Web (Tim Berners-Lee’s first server ran on a NeXT).
- Enjoying the UI that later became macOS (AppKit, Dock, Services, .app bundles).
- Collecting: a working color NeXTcube with turbo slab and original laser printer is a conversation piece worth several thousand dollars.
Practical compromise
Keep your modern PC (or a quiet mini-PC/Mac mini) for real work and run a NeXT emulator instead:
- Previous (68k emulator) boots NeXTSTEP 3.3 on macOS or Linux.
- VirtualNeXT (based on Previous) for Windows.
- QEMU’s “next” target is improving.
You get the look-and-feel, can compile legacy code, and still have 4K video calls when you need them.
Bottom line
A NeXTcube is a beautiful, important machine—buy one if you love vintage hardware—but do not voluntarily make it the computer you rely on to pay bills, join Zoom meetings, or play YouTube.
Open weights is an equivalent of binary driver blobs everyone hates. "Here is an opaque thing, you have to put it on your computer and trust it, and you can't modify it."
I'm reading newspaper articles through a MoE of gemini3flash and gpt5mini, and what made it hard to use open models (at the time) was a lack of support for pydantic.
You should try out K2.5 for your use case, it might actually succeed where previous generation open source models failed.
I just don't have enough funding to do a ton of tests
Also, that's with no context, so it would be slower as it filled (I don't think K2.5 uses the Kimi-Linear KDA attention mechanism, so it's sub-quadratic but not their lowest).
Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.
For example using two Strix Halo you can get 17 or so tokens/s with MiniMax M2.1 Q6. That's a 229B parameter model with a 10b active set (7.5GB at Q6). The theoretical maximum speed with 256GB/s of memory bandwidth would be 34 tokens/s.
Would i use it a gain compared to Deep Research products elsewhere? Maybe, probably not but only bc it's hard to switch apps
And don't start me with the "yeah but if the PRC" because it's gross when US can de facto ban and impose conditions even on European companies, let alone the control it has on US ones.
[0] https://en.wikipedia.org/wiki/Moonshot_AI#Funding_and_invest...
Just today I asked for a code review and it flagged a method that can be `static`. The problem is it was already static. That kind of stuff never happens with Opus 4.5 as far as I can tell.
Also, in an opencode Plan mode (read only). It generated a plan and instead of presenting it and stopping, decided to implement it. Could not use the edit and write tools because the harness was in read only mode. But it had bash and started using bash to edit stuff. Wouldn't just fucking stop even though the error messages it received from opencode stated why. Its plan and the resulting code was ok so I let it go crazy though...
I still find Opus is "sharper" technically, tackles problems more completely & gets the nuance.
But man Kimi k2.5 can write. Even if I don't have a big problem description, just a bunch of specs, Kimi is there, writing good intro material, having good text that more than elaborates, that actually explains. Opus, GLM-4.7 have both complemented Kimi on it's writing.
Still mainly using my z.ai glm-4.7 subscription for the work, so I don't know how capable it really is. But I do tend to go for some Opus in sticky spots, and especially given the 9x price difference, I should try some Kimi. I wish I was set up for better parallel evaluation; feels like such a pain to get started.
(https://platform.moonshot.ai/docs/guide/agent-support#config...)
How does Kimi 2.5 compare to it in real world scenarios?
It seems to be priced the same and if it’s being hosted somewhere vs run locally it’s still a worse model, the only advantage would be it is not Anthropic or OpenAI.
DeepSeek 3.2 was already quite compelling. I expect its successor will be competitive.
kimi k2 thinking turbo
"He’s not being poetic; he’s weaponizing nostalgia. Quoting Casablanca—the ultimate “beautiful sacrifice” romance—to taunt Morton about corporate backstabbing is pure executive sleaze. It’s the same move as his mahogany desk and flag pin: draping sentimental Americana over naked power. He’s essentially saying, “Isn’t it romantic, Bob? You fought the good fight, now you die.” "
grok 4.1
"So when Dick Jones smirks and says, “Time goes by, it’s the same old story… the fight for love and glory, huh, Bob?” that’s pure Ed Neumeier flexing his love of old Hollywood while simultaneously mocking how these psychopathic ’80s execs think they’re in some grand cinematic drama instead of just murdering a guy for market share."
chatgpt-4o
"The reason the Casablanca reference matters is RoboCop itself is partially a noir—and also a tragedy built inside a genre engine. [… ] So yeah, calling back to Casablanca is not just some meta-joke—it’s a knowing wink that RoboCop is a noir trapped in cybernetic skin. // Even Clarence Boddicker, for all his cartoonish sadism, plays like a twisted stand-in for a noir heavy—talking smooth, dropping nicknames, throwing out casual violence like it’s a punchline."
opus 4.5
"Jones is performing avuncular menace. He's the kind of executive who'd quote old movies while having you murdered, because it's all performance to him, all style. The substance is just power."