I still haven't experienced a local model that fits on my 64GB MacBook Pro and can run a coding agent like Codex CLI or Claude code well enough to be useful.
Maybe this will be the one? This Unsloth guide from a sibling comment suggests it might be: https://unsloth.ai/docs/models/qwen3-coder-next
This distinction is important because some "we support local model" tools have things like ollama orchestration or use the llama.cpp libraries to connect to models on the same physical machine.
That's not my definition of local. Mine is "local network". so call it the "LAN model" until we come up with something better. "Self-host" exists but this usually means more "open-weights" as opposed to clamping the performance of the model.
It should be defined as ~sub-$10k, using Steve Jobs megapenny unit.
Essentially classify things as how many megapennies of spend a machine is that won't OOM on it.
That's what I mean when I say local: running inference for 'free' somewhere on hardware I control that's at most single digit thousands of dollars. And if I was feeling fancy, could potentially fine-tune on the days scale.
A modern 5090 build-out with a threadripper, nvme, 256GB RAM, this will run you about 10k +/- 1k. The MLX route is about $6000 out the door after tax (m3-ultra 60 core with 256GB).
Lastly it's not just "number of parameters". Not all 32B Q4_K_M models load at the same rate or use the same amount of memory. The internal architecture matters and the active parameter count + quantization is becoming a poorer approximation given the SOTA innovations.
What might be needed is some standardized eval benchmark against standardized hardware classes with basic real world tasks like toolcalling, code generation, and document procesing. There's plenty of "good enough" models out there for a large category of every day tasks, now I want to find out what runs the best
Take a gen6 thinkpad P14s/macbook pro and a 5090/mac studio, run the benchmark and then we can say something like "time-to-first-token/token-per-second/memory-used/total-time-of-test" and rate this as independent from how accurate the model was.
I am fine renting an H100 (or whatever), as long as I theoretically have access to and own everything running.
I do not want my career to become dependent upon Anthropic.
Honestly, the best thing for "open" might be for us to build open pipes and services and models where we can rent cloud. Large models will outpace small models: LLMs, video models, "world" models, etc.
I'd even be fine time-sharing a running instance of a large model in a large cloud. As long as all the constituent pieces are open where I could (in theory) distill it, run it myself, spin up my own copy, etc.
I do not deny that big models are superior. But I worry about the power the large hyperscalers are getting while we focus on small "open" models that really can't match the big ones.
We should focus on competing with large models, not artisanal homebrew stuff that is irrelevant.
There's just not a good way to visualize the compute needed, with all the nuance that exists. I think that trying to create these abstractions are what leads to people impulse buying resource-constrained hardware and getting frustrated. The autoscalers have a huge advantage in this field that homelabbers will never be able to match.
Would it not help with the DDR4 example though if we had more "real world" tests?
I use opencode and have done a few toy projects and little changes in small repositories and can get pretty speedy and stable experience up to a 64k context.
It would probably fall apart if I wanted to use it on larger projects, but I've often set tasks running on it, stepped away for an hour, and had a solution when I return. It's definitely useful for smaller project, scaffolding, basic bug fixes, extra UI tweaks etc.
I don't think "usable" a binary thing though. I know you write lot about this, but it'd be interesting to understand what you're asking the local models to do, and what is it about what they do that you consider unusable on a relative monster of a laptop?
What's interesting to me about this model is how good it allegedly is with no thinking mode. That's my main complaint about qwen3:30b, how verbose its reasoning is. For the size it's astonishing otherwise.
I'm hoping for an experience where I can tell my computer to do a thing - write a code, check for logged errors, find something in a bunch of files - and I get an answer a few moments later.
Setting a task and then coming back to see if it worked an hour later is too much friction for me!
I've had mild success with GPT-OSS-120b (MXFP4, ends up taking ~66GB of VRAM for me with llama.cpp) and Codex.
I'm wondering if maybe one could crowdsource chat logs for GPT-OSS-120b running with Codex, then seed another post-training run to fine-tune the 20b variant with the good runs from 120b, if that'd make a big difference. Both models with the reasoning_effort set to high are actually quite good compared to other downloadable models, although the 120b is just about out of reach for 64GB so getting the 20b better for specific use cases seems like it'd be useful.
I wonder if it has to do with the message format, since it should be able to do tool use afaict.
Who pays for a free model? GPU training isn't free!
I remember early on people saying 100B+ models will run on your phone like nowish. They were completely wrong and I don't think it's going to ever really change.
People always will want the fastest, best, easiest setup method.
"Good enough" massively changes when your marketing team is managing k8s clusters with frontier systems in the near future.
People do not care about the fastest and best past a point.
Let's use transportation as an analogy. If all you have is a horse, a car is a massive improvement. And when cars were just invented, a car with a 40mph top speed was a massive improvement over one with a 20mph top speed and everyone swapped.
While cars with 200mph top speeds exist, most people don't buy them. We all collectively decided that for most of us, most of the time, a top speed of 110-120 was plenty, and that envelope stopped being pushed for consumer vehicles.
If what currently takes Claude Opus 10 minutes to do can be done is 30ms, then making something that can do it in 20ms isn't going to be enough to get everyone to pay a bunch of extra money for.
Companies will buy the cheapest thing that meets their needs. SOTA models right now are much better than the previous generation but we have been seeing diminishing returns in the jump sizes with each of the last couple generations. If the gap between current and last gen shrinks enough, then people won't pay extra for current gen if they don't need it. Just like right now you might use Sonnet or Haiku if you don't think you need Opus.
When that happens, the most powerful AI will be whichever has the most virtuous cycles going with as wide a set of active users as possible. Free will be hard to compete with because raising the price will exclude the users that make it work.
Until then though, I think you're right that open will lag.
Phones are never going to run the largest models locally because they just don't have the size, but we're seeing improvements in capability at small sizes over time that mean that you can run a model on your phone now that would have required hundreds of billions of parameters less than 6 years ago.
Just the other day I was reading a paper about ANNs whose connections aren't strictly feedforward but, rather, circular connections proliferate. It increases expressiveness at the (huge) cost of eliminating the current gradient descent algorithms. As compute gets cheaper and cheaper, these things will become feasible (greater expressiveness, after all, equates to greater intelligence).
I was just talking at a high level. If transformers are HDD technology, maybe there's SSD right around the corner that's a paradigm shift for the whole industry (but for the average user just looks like better/smarter models). It's a very new field, and it's not unrealistic that major discoveries shake things up in the next decade or less.
On M1 64GB Q4KM on llama.cpp gives only 20Tok/s while on MLX it is more than twice as fast. However, MLX has problems with kv cache consistency and especially with branching. So while in theory it is twice as fast as llama.cpp it often does the PP all over again which completely trashes performance especially with agentic coding.
So the agony is to decide whether to endure half the possible speed but getting much better kv-caching in return. Or to have twice the speed but then often you have again to sit through prompt processing.
But who knows, maybe Qwen gives them a hand? (hint,hint)
Now if you are not happy with the last answer, you maybe want to simply regenerate it or change your last question - this is branching of the conversation. Llama.cpp is capable of re-using the KV cache up to that point while MLX does not (I am using MLX server from MLX community project). I haven't tried with LMStudio. Maybe worth a try, thanks for the heads-up.
That would imply they'd have to be actually smarter than humans, not just faster and be able to scale infinitely. IMHO that's still very far away..
Perhaps I'm grossly wrong -- I guess time will tell.
There is also the counter-intuitive phenomenon where training a model on a wider variety of content than apparently necessary for the task makes it better somehow. For example, models trained only on English content exhibit measurably worse performance at writing sensible English than those trained on a handful of languages, even when controlling for the size of the training set. It doesn't make sense to me, but it probably does to credentialed AI researchers who know what's going on under the hood.
i.e. there is a lot of commonality between programming languages just as there is between human languages, so training on one language would be beneficial to competency in other languages.
System info:
$ ./llama-server --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 7897 (3dd95914d)
built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line: $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
--ctx-size 32768Not as good as running the entire thing on the GPU, of course.
Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?
Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page?
The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.
I guess it would make sense to have something like max context size/quants that fit fully on common configs with gpus, dual gpus, unified ram on mac etc.
brew upgrade llama.cpp # or brew install if you don't have it yet
Then: llama-cli \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this: llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--fit on \
--seed 3407 \
--temp 1.0 \
--top-p 0.95 \
--min-p 0.01 \
--top-k 40 \
--jinja
It's using about 28GB of RAM.Are there any clear winners per domain? Code, voice-to-text, text-to-voice, text editing, image generation, text summarization, business-text-generation, music synthesis, whatever.
Even for my usual toy coding problems it would get simple things wrong and require some poking to get to it.
A few times it got stuck in thinking loops and I had to cancel prompts.
This was using the recommended settings from the unsloth repository. It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.
3.7 was not all that great. 4 was decent for specific things, especially self contained stuff like tests, but couldn't do a good job with more complex work. 4.5 is now excellent at many things.
If it's around the perf of 3.7, that's interesting but not amazing. If it's around 4, that's useful.
Of course you get degraded performance with this.
Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.
There's no reason for a coding model to contain all of ao3 and wikipedia =)
If we knew how to create a SOTA coding model by just putting coding stuff in there, that is how we would build SOTA coding models.
I had not considered that, seems like a great solution for local models that may be more resource-constrained.
Besides, programming is far from just knowing how to autocomplete syntax, you need a model that's proficient in the fields that the automation is placed in, otherwise they'll be no help in actually automating it.
I'm more bullish about small and medium sized models + efficient tool calling than I'm about LLMs too large to be run at home without $20k of hardware.
The model doesn't need to have the full knowledge of everything built into it when it has the toolset to fetch, cache and read any information available.
Video is speed up. I ran it through LM Studio and then OpenCode. Wrote a bit about how I set it all up here: https://www.tommyjepsen.com/blog/run-llm-locally-for-coding
Overall, it's allowed me to maintain more consistent workflows as I'm less dependent on Opus. Now that Mastra has introduced the concept of Workspaces, which allow for more agentic development, this approach has become even more powerful.
Related: as an actual magician, although no longer performing professionally, I was telling another magician friend the other day that IMHO, LLMs are the single greatest magic trick ever invented judging by pure deceptive power. Two reasons:
1. Great magic tricks exploit flaws in human perception and reasoning by seeming to be something they aren't. The best leverage more than one. By their nature, LLMs perfectly exploit the ways humans assess intelligence in themselves and others - knowledge recall, verbal agility, pattern recognition, confident articulation, etc. No other magic trick stacks so many parallel exploits at once.
2. But even the greatest magic tricks don't fool their inventors. David Copperfield doesn't suspect the lady may be floating by magic. Yet, some AI researchers believe the largest, most complex LLMs actually demonstrate emergent thinking and even consciousness. It's so deceptive it even fools people who know how it works. To me, that's a great fucking trick.
From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
I wish AMD would get around to adding NPU support in Linux for it though, it has more potential that could be unlocked.
I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.
I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.
I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.
I suspect the API providers will offer this model for nice and cheap, too.
I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...
--no-mmap --fa on options seemed to help, but not dramatically.
As with everything Spark, memory bandwidth is the limitation.
I'd like to be impressed with 30tok/sec but it's sort of a "leave it overnight and come back to the results" kind of experience, wouldn't replace my normal agent use.
However I suspect in a few days/weeks DeepInfra.com and others will have this model (maybe Groq, too?), and will serve it faster and for fairly cheap.
To be clear, since this confuses a lot of people in every thread: Anthropic will let you use their API with any coding tools you want. You just have to go through the public API and pay the same rate as everyone else. They have not "blocked" or "banned" any coding tools from using their API, even though a lot of the clickbait headlines have tried to insinuate as much.
Anthropic never sold subscription plans as being usable with anything other than their own tools. They were specifically offered as a way to use their own apps for a flat monthly fee.
They obviously set the limits and pricing according to typical use patterns of these tools, because the typical users aren't maxing out their credits in every usage window.
Some of the open source tools reverse engineered the protocol (which wasn't hard) and people started using the plans with other tools. This situation went on for a while without enforcement until it got too big to ignore, and they began protecting the private endpoints explicitly.
The subscription plans were never sold as a way to use the API with other programs, but I think they let it slide for a while because it was only a small number of people doing it. Once the tools started getting more popular they started closing loopholes to use the private API with other tools, which shouldn't really come as a surprise.
eventually they added subscription support and that worked better than cline or kilo, but im still not clear what anthropic tools the subscription was actually useful for
Some people think LLMs are the final frontier. If we just give in and let Anthropic dictate the terms to us we're going to experience unprecedented enshittification. The software freedom fight is more important than ever. My machine is sovereign; Anthropic provides the API, everything I do on my machine is my concern.
They simply don't want to compete, they want to force the majority of people that can't spend a lot on tokens to use their inferior product.
Why build a better product if you control the cost?
Problem is, most people don't do this, choosing convenience at any given moment without thinking about longer-term impact. This hurts us collectively by letting governments/companies, etc tighten their grip over time. This comes from my lived experience.
I agree anticompetitive behavior is bad, but the productivity gains to be had by using Anthropic models and tools are undeniable.
Eventually the open tools and models will catch up, so I'm all for using them locally as well, especially if sensitive data or IP is involved.
I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).
I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.
I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".
I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.
There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.
For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.
That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.
At the moment I have a personal Claude Max subscription and ChatGPT Enterprise for Codex at work. Using both, I feel pretty definitively that gpt-5.2-codex is strictly superior to Opus 4.5. When I use Opus 4.5 I’m still constantly dealing with it cutting corners, misinterpreting my intentions and stopping when it isn’t actually done. When I switched to Codex for work a few months ago all of those problems went away.
I got the personal subscription this month to try out Gas Town and see how Opus 4.5 does on various tasks, and there are definitely features of CC that I miss with Codex CLI (I can’t believe they still don’t have hooks), but I’ve cancelled the subscription and won’t renew it at the end of this month unless they drop a model that really brings them up to where gpt-5.2-codex is at.
Edit: It's very true that the big 4 labs silently mess with their models and any action of that nature is extremely user hostile.
I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.
What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.
Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?
I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.
The day for open models will come but it still feels so close and so far.
If people start using the Claude Max plans with other agent harnesses that don't use the same kinds of optimizations the economics may no longer have worked out.
(But I also buy that they're going for horizontal control of the stack here and banning other agent harnesses was a competitive move to support that.)
They seem to have started rejecting 3rd party usage of the sub a few weeks ago, before Claw blew up.
By the way, does anyone know about the Agents SDK? Apparently you can use it with an auth token, is anyone doing that? Or is it likely to get your account in trouble as well?
I've had a similar experience with opencode, but I find that works better with my local models anyway.
(There probably is, but I found it very hard to make sense of the UI and how everything works. Hard to change models, no chat history etc.?)
> hitting that limit is within the terms of the agreement with Anthropic
It's not, because the agreement says you can only use CC.
Selling dollars for $.50 does that. It sounds like they have a business model issue to me.
Without knowing the numbers it's hard to tell if the business model for these AI providers actually works, and I suspect it probably doesn't at the moment, but selling an oversubscribed product with baked in usage assumptions is a functional business model in a lot of spaces (for varying definitions of functional, I suppose). I'm surprised this is so surprising to people.
Being a common business model and it being functional are two different things. I agree they are prevalent, but they are actively user hostile in nature. You are essentially saying that if people use your product at the advertised limit, then you will punish them. I get why the business does it, but it is an adversarial business model.
There are already many serious concerns about sharing code and information with 3rd parties, and those Chinese open models are dangerously close to destroying their entire value proposition.
The problem is, there's not a clear every-man value like Uber has. The stories I see of people finding value are sparse and seem from the POV of either technosexuals or already strong developer whales leveraging the bootstrapy power .
If AI was seriously providing value, orgs like Microsoft wouldn't be pushing out versions of windows that can't restart.
It clearly is a niche product unlike Uber, but it's definitely being invested in like it is universal product.
It's within their capability to provision for higher usage by alternative clients. They just don't want to.
it's like Apple: you can use macOS only on our Macs, iOS only on iPhones, etc. but at least in the case of Apple, you pay (mostly) for the hardware while the software it comes with is "free" (as in free beer).
Could have just turned a blind eye.
(Edit due to rate-limiting: I see, thanks -- I wasn't aware there was more than one token type.)
That's not the product you buy when you a Claude Code token, though.
This confused me for a while, having two separate "products" which are sold differently, but can be used by the same tool.
If a company is going to automate our jobs, we shouldn't be giving them money and data to do so. They're using us to put ourselves out of work, and they're not giving us the keys.
I'm fine with non-local, open weights models. Not everything has to run on a local GPU, but it has to be something we can own.
I'd like a large, non-local Qwen3-Coder that I can launch in a RunPod or similar instance. I think on-demand non-local cloud compute can serve as a middle ground.
I can also imagine a dysfunctional future where a developers spend half their time convincing their AI agents that the software they're writing is actually aligned with the model's set of values
And yeah, I got three (for some reason) emails titled "Your account has been suspended" whose content said "An internal investigation of suspicious signals associated with your account indicates a violation of our Usage Policy. As a result, we have revoked your access to Claude.". There is a link to a Google Form which I filled out, but I don't expect to hear back.
I did nothing even remotely suspicious with my Anthropic subscription so I am reasonably sure this mirroring is what got me banned.
Edit: BTW I have since iterated on doing the same mirroring using OpenCode with Codex, then Codex with Codex and now Pi with GPT-5.2 (non-Codex) and OpenAI hasn't banned me yet and I don't think they will as they decided to explicitly support using your subscription with third party coding agents following Anthropic's crackdown on OpenCode.
I'm not so sure. It doesn't sound like you were circumventing any technical measures meant to enforce the ToS which I think places them in the wrong.
Unless I'm missing some obvious context (I don't use Mac and am unfamiliar with the Bun.spawn API) I don't understand how hooking a TUI up to a PTY and piping text around is remotely suspicious or even unusual. Would they ban you for using a custom terminal emulator? What about a custom fork of tmux? The entire thing sounds absurd to me. (I mean the entire OpenCode thing also seems absurd and wrong to me but at least that one is unambiguously against the ToS.)
It’d be cool if Anthropic were bound by their terms of use that you had to sign. Of course, they may well be broad enough to fire customers at will. Not that I suggest you expend any more time fighting this behemoth of a company though. Just sad that this is the state of the art.
* Subscription plans, which are (probably) subsidized and definitely oversubscribed (ie, 100% of subscribers could not use 100% of their tokens 100% of the time).
* Wholesale tokens, which are (probably) profitable.
If you try to use one product as the other product, it breaks their assumptions and business model.
I don't really see how this is weaponized malaise; capacity planning and some form of over-subscription is a widely accepted thing in every industry and product in the universe?
Also, this is more like "I sell a service called take a bike to the grocery store" with a clause in the contract saying "only ride the bike to the grocery store." I do this because I am assuming that most users will ride the bike to the grocery store 1 mile away a few times a week, so they will remain available, even though there is an off chance that some customers will ride laps to the store 24/7. However, I also sell a separate, more expensive service called Bikes By the Hour.
My customers suddenly start using the grocery store plan to ride to a pub 15 miles away, so I kick them off of the grocery store plan and make them buy Bikes By the Hour.
They could, of course, price your 10GB plan under the assumption that you would max out your connection 24 hours a day.
I fail to see how this would be advantageous to the vast majority of the customers.
OpenCode et al continue to work with my Max subscription.
What Anthropic blocked is using OpenCode with the Claude "individual plans" (like the $20/month Pro or $100/month Max plan), which Anthropic intends to be used only with the Claude Code client.
OpenCode had implemented some basic client spoofing so that this was working, but Anthropic updated to a more sophisticated client fingerprinting scheme which blocked OpenCode from using this individual plans.
I recommend Ghostty for Mac users. Alacritty probably works too.
{
"plugin": [
"opencode-anthropic-auth@latest"
]
}Please list what capabilities you would like our local model to have and how you would like to have it served to you.
[1] a sovereign digital nation built on a national framework rather than a for-profit or even non-profit framework, will be available at https://stateofutopia.com (you can see some of my recent posts or comments here on HN.)
[2] https://www.youtube.com/live/0psQ2l4-USo?si=RVt2PhGy_A4nYFPi
It's that simple. Everyone else is trying to compete in other ways and Anthropic are pushing for dominate the market.
They'll eventually lose their performance edge and suddenly they will back to being cute and fluffy
I've cancelled a clause sub, but still have one.
I've tried all of the models available right now, and Claude Opus is by far the most capable.
I had an assertion failure triggered in a fairly complex open-source C library I was using, and Claude Opus not only found the cause, but wrote a self-contained reproduction code I could add to a GitHub issue. And it also added tests for that issue, and fixed the underlying issue.
I am sincerely impressed by the capabilities of Claude Opus. Too bad its usage is so expensive.
I wonder what they are up to.
You are doing that all the time. You just draw the line, arbitrarily.
It's like this old adage "Our brains are poor masters and great slaves". We are basically just wanting to survive and we've trained ourselves to follow the orders of our old corporate slave masters who are now failing us, and we are unfortunately out of fear paying and supporting anticompetitive behavior and our internal dissonance is stopping us from changing it (along with fear of survival and missing out and so forth).
The global marketing by the slave master class isn't helping. We can draw a line however arbitrary we'd like though and its still better and more helpful than complaining "you drew a line arbitrarily" and not actually doing any of the hard courageous work of drawing lines of any kind in the first place.
Hope they update the model page soon https://chat.qwen.ai/settings/model
Sorry, but we're talking about models as content now? There's almost always a better word than "content" if you're describing something that's in tech or online.
Granted these 80B models are probably optimized for H100/H200 which I do not have. Here's to hoping that OpenClaw compat. survives quantization
I would recommend trying llama.cpp's llama-server with models of increasing size until you hit the best quality / speed tradeoff with your hardware that you're willing to accept.
The Unsloth guides are a great place to start: https://unsloth.ai/docs/models/qwen3-coder-next#llama.cpp-tu...
one more thing, that guide says:
> You can choose UD-Q4_K_XL or other quantized versions.
I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?
IQ4_XS
Q4_K_S
Q4_1
IQ4_NL
MXFP4_MOE
Q4_0
Q4_K_M
Q4_K_XLDoes anyone any experience with these and is this release actually workable in practice?
They are usually as good as the flagship model for 12-18 months ago. Which may sound like a massive difference, because somehow it is, but it's also fairly reasonable, you don't need to live to the bleeding edge.
Running this thing locally on my Spark with 4-bit quant I'm getting 30-35 tokens/sec in opencode but it doesn't feel any "stupider" than Haiku, that's for sure. Haiku can be dumb as a post. This thing is smarter than that.
It feels somewhere around Sonnet 4 level, and I am finding it genuinely useful at 4-bit even. Though I have paid subscriptions elsewhere, so I doubt I'll actually use it much.
I could see configuration OpenCode somehow to use paid Kimi 2.5 or Gemini for the planning/analysis & compaction, and this for the task execution. It seems entirely competent.
The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
It's one thing running the model without any context, but coding agents build it up close to the max and that slows down generation massively in my experience.
The instability of the tooling outside of the LLM is what keeps me from building anything on the cloud, because you're attaching your knowledge and work flow to a tool that can both change dramatically based on context, cache, and model changes and can arbitrarily raise prices as "adaptable whales" push the cost up.
Its akin to learning everything about beanie babies in the early 1990's and right when you think you understand the value proposition, suddenly they're all worthless.
So we've seen a series of big ones already -- GLM 4.7 Flash, Kimi 2.5, StepFun 3.5, and now this. Still to come is likely a new DeepSeek model, which could be exciting.
And then I expect the Big3, OpenAI/Google/Anthropic will try to clog the airspace at the same time, to get in front of the potential competition.
Compared to RISC core designs or IC optimization, the pace of AI innovation is slow and easy to follow.
On a misc note: What's being used to create the screen recordings? It looks so smooth!
Im currently using qwen 2.5 16b , and it works really well
I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.
When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.
Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.
That's the tyranny of comfort. Same for high end car, living in a big place, etc.
There's a good work around though: just don't try the luxury in the first place so you can stay happy with the 9 months delay.
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.
And at the end of the day, does it matter?