Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

139 pointsby cloudking4 hours ago47 comments

horsawlarway43 minutes ago
For personal use, yes.
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
- agup792a few seconds ago
  That sounds amazing. If I had some GPUs sitting around, I would totally do it. Sounds expensive to do it otherwise though.
- gonzalohm8 minutes ago
  Did you double the tokens per second by adding a second GPU or was the increase significantly less?
  - mirekrusin3 minutes ago
    You’re adding extra gpu for more vram, not speed.
bluejay238726 minutes ago
About 90% of my coding is on Qwen 3.6 27b and Open Code with some custom skills and Semble. It is NOT as smart as CC or Codex but its enough to get most of my work done. I didn't set out to replace CC and Codex (I have an RTX 6000 so the TPS is faster than I care about, but the RTX 6000 was originally for other work). I only tried this just to see how close you could get to a frontier model for coding as an experiment, but it was good enough that I stuck with it. I still fall back to Codex for really complicated stuff and to polish UI's as that seems to be the weakest element to working in Qwen.This isn't a recommendation because I don't think most people have an RTX 6000 laying around and the cost would be many years of MAX CC or Codex subscriptions, but at least this seems possible. Maybe in a few more years it will even be practical.
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
- bo102411 minutes ago
  Qwen3.5-122B is actually Qwen3.5-122B-A10B. The A10B means that this is a "mixture of experts" model where only 10B parameters are activated at a given time. Whereas Qwen3.6-27B is a "dense" model where all 27B parameters are activated all the time. So for many tasks, you'd expect the 27B dense model to be better than the 122B-A10B model.
- htrp12 minutes ago
  why 27b vs 35b? Is MoE that much worse for coding?
pierotofyan hour ago
Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/
- jacobgold30 minutes ago
  "Quality is like running edge models from 8-12 months ago."
  That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.
  - sbrother11 minutes ago
    I strongly agree on that being the release where these tools got good enough to substantially speed up my professional work. I have to admit I was super skeptical of AI coding until then.
  - pierotofy26 minutes ago
    I use it for work.
    jacobgold19 minutes ago
    That's cool if you prefer it, but it is hard to imagine it being a strictly rational choice when much better quality is available at a price that is small relative to the cost of an employee.
    lokara few seconds ago
    Won’t it depend on what you use it for? A less capable system might be fine for boilerplate, moderate re-factoring, etc. Not everyone is building whole features in one go.
    vector_spaces6 minutes ago
    Not all work requires every facet to be so sharply optimized, and there may be other constraints that are completely invisible to you. Some that were easy for me to imagine: the parent works in a heavily regulated industry, their IT team is slow-moving and paranoid and this is a safe, under-the-radar workaround, the output is "good enough" for their purposes and they find tinkering with it to be fun.
    Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say
- trueno23 minutes ago
  i have a 128gb m4 max macbook pro i've been wanting to tinker with this stuff but genuinely never find the time. any mac users in here running similar to the above that can share their experience?
  i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.
  - htrp10 minutes ago
    Use your ClaudeCode sub and tell it to set it up for you
- daveidol15 minutes ago
  Do you do your dev work on the windows machine (referenced in the docs), or do you remotely access it from a separate machine? I ask because I have a RTX 3090 kicking around in a gaming desktop, but I don't use it for any dev work (I use a Macbook Pro).
- atomicnumber342 minutes ago
  Same. I have no desire to use Claude at all anymore.
  - pierotofy34 minutes ago
    Yep. Screw Anthropic, CloseAI and all other rent seekers in this space.
- dheera25 minutes ago
  Am I doing something wrong or has ollama become shittified?
  I'm looking at https://ollama.com/search and the top few models like kimi-k2.7-code say "cloud" and I can't seem to ollama pull them.
  I thought the whole POINT of ollama was not-cloud?
  - hoherd2 minutes ago
    I experienced the same situation a month or two ago. One of my friends sent me this article that was illuminating. https://sleepingrobots.com/dreams/stop-using-ollama/
  - toyg7 minutes ago
    Yes, you've nailed it. Ollama are desperately trying to pull a Cursor - like 3791 other projects in this space.
  - satvikpendem17 minutes ago
    Ollama is not recommended to be used. Use llama.cpp.
- lelandbatey29 minutes ago
  I use it, it's good, I get work done, but know that they really mean it when they say
  > "Quality is like running edge models from 8-12 months ago"
  Don't expect Opus, expect more like Haiku. If you micromanage it, you'll get great results. If you want it to be a human in a box, it'll flounder.
- dominotw30 minutes ago
  how much does the setup cost if i want to buy all the hardware now and increased power costs?
sosodevan hour ago
The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
- argeean hour ago
  I use Gemma 4 26B A4B on my Macbook (M4 Pro, 48 GB RAM) to study Rust (and ask other myriad questions). I don't trust it to do a good job in an IDE/harness to one-shot anything but the most trivial of changes. Still, it's fast and good enough that it could handle being a "co-pilot" on small to medium context tasks where you've got your hands on the wheel and your eyes on the road — and are driving under the speed limit. That's remarkable given where we were a couple of years ago.
  I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)
zaptheimpaler4 minutes ago
I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
codinhoodan hour ago
I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
- jrm4an hour ago
  But you're pretty much measuring opportunity cost in tokens per second, no?
  I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
  I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
  - codinhood7 minutes ago
    If you’re arguing that model metrics don’t necessarily translate into useful output, I agree. That’s not how I measure the success of a mode and not really the point I'm trying to make. I try to set things up and test it on my actual projects.
    What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
    Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
  - Rastonbury4 minutes ago
    I think they are referring to the opportunity cost of time saved on doing things a local model cannot do or fixing it's mistakes against the cost of a subscription
arjie2 hours ago
Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
- akersten3 minutes ago
  > I have 2x RTX Pro 6000 Blackwell
  Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...
- leptonsan hour ago
  Have you measured your electricity consumption for this rig? I have to wonder how much it would cost you per month.
  - ux26647820 minutes ago
    Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.
    The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
mv45 minutes ago
I've been using MiniMax M2.7 with vllm on my dual Nvidia Spark cluster. Slow (<20 tps) but functional for most of my use cases.
Kostic38 minutes ago
For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
stymaar43 minutes ago
Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
HappySweeney2 hours ago
I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
NetOpWibby28 minutes ago
I'm looking forward to having Claude Fable at home. THAT is when I'll THINK about replacing Claude (who knows what their next models will be capable of, Fable was damn good for the three days I had it).
- trueno22 minutes ago
  we keep moving the goalposts on when we're gonna be happy with local. first it was sonnet at home as the good enough, then opus, now it's the mysterious leading model that runs on infrastructure we can't feasibly have at home
acc_2972 hours ago
I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
- roliszan hour ago
  I'm interested in trying something similar. I was thinking to do this for my OpenClaw agent.
  About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that
nfrankelan hour ago
I tried. It works in theory: https://blog.frankel.ch/tokensparsamkeit-coding-assistants/#...
Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
cuttysnarkan hour ago
I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
jmichaelsonan hour ago
I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
BiraIgnacio38 minutes ago
I tried for a bit, with llama.cpp + Qwen + Mac Pro but the results were very poor (both quality and speed) I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
boringg36 minutes ago
Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..
ecshaferan hour ago
I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.
AH4oFVbPT4f817 minutes ago
Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.
- xeonax3 minutes ago
  Whats .NET doing in between?
cheekygeekyan hour ago
Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...
K0balt2 hours ago
Pretty good results with qwen 3.6 27b dense. I’d say it’s about equal to (Claude) haiku 4.5 maybe sonnet depending on the task.
- kadobanan hour ago
  What tool do you use to drive things for you, out of curiosity?
- kandros2 hours ago
  I’d rather ask my butcher than Haiku for coding tasks
  - papichulo423 minutes ago
    Agreed on this. Anthropic has now changed the verbiage on the definitions of the models under `/model` to say that Opus is for everyday usage, and Sonnet is for routine tasks.
    There's apparently a reason Sonnet and Haiku have been left in previous version #s.
    Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.
mitchell_han hour ago
Tried. The context windows just weren't big enough.
- deadbabean hour ago
  Prompt more directly instead of open ended.
- lysacean hour ago
  Got a similar result (my RTX 4070 only has 12 GB). I'm curious about whether 24/32 GB meaningfully improves this enough to make it useful.
  - tobyhinloopenan hour ago
    Try it on RAM and CPU.
    It’s slower but you can run them.
    lysace2 minutes ago
    Good idea for evaluating the models, thanks.
blurbleblurblean hour ago
My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.
- horsawlarway29 minutes ago
  Pi is decent.
  I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).
  Pi is... just fine.
  It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.
  If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.
  Tack on some of the extensions (ex https://pi.dev/packages/pi-mcp-adapter?name=mcp and https://pi.dev/packages/pi-web-access?name=search) and I get web tooling (ex - perplexity search), access to mcps to do things like drive chrome (https://browsermcp.io/) or firefox (https://github.com/mozilla/firefox-devtools-mcp)
  It's fine. Is it as good as a subsidized top tier model? Nope. Is it free and still very capable? Yup.
  And personally, I've been having a LOT of fun with the pi sdk (https://pi.dev/docs/latest/sdk)
  Which is something that all the other providers charge you api access rates for (ex - thousands a month).
- Insanityan hour ago
  Heard good things about pi.dev but haven’t tried it. It might take care of some of those missing features you mentioned.
  - bityard40 minutes ago
    pi.dev is more like an agent developer kit. It's basically a substrate upon which you spend hours/days/weeks building your own agents or coding framework. It's pretty much the neovim to claude's vscode.
    horsawlarway25 minutes ago
    I mean - the base experience is just fine, with perfectly reasonable built in tools for file access and editing, plus bash.
    But yes - it expands a lot if you're willing to play with it.
    I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.
anubhav20036 minutes ago
Yes, llama.cpp, qwen27b, 35b, claude code. Llama-cpp-manager for managing llama.cpp configs (https://github.com/anubhavgupta/llama-cpp-manager)
dabinatan hour ago
There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.
gigatexal4 minutes ago
I tried to. I just couldn't get over how it made my otherwise whisper quiet M3 Max MacBook Pro 14 for the performance. The sweet spot has been adopting Claude Code to use the Chinese models. Deepseek V4 Pro is very, very good. But I am such a casual local user of AI that my 20/month Claude subscription is enough and I find myself using that more and more.
hegdeezy40 minutes ago
I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!
anonymousiaman hour ago
This was posted shortly after your Ask HN post:
My Homelab AI Dev Platform
https://news.ycombinator.com/item?id=48542433
Lwerewolfan hour ago
mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.
wuschelan hour ago
I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.
jwr37 minutes ago
I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
tumetab12 hours ago
Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.
Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
anubhav20035 minutes ago
Yes, llama.cpp, qwen 27b and 35b, llama-cpp-manager for managing model configs.(https://github.com/anubhavgupta/llama-cpp-manager)
ryandrakean hour ago
Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
- riazrizvi35 minutes ago
  All you get here is some market signal from 1 or 2 posts if you already know how to do it. Most of these responses are garbage.
- porkloin18 minutes ago
  I have good results with this setup:
  Hardware:
  - GPU: AMD 7900xtx, 24gb vram
  - CPU: AMD 5950x, AM4
  - RAM: 64gb DDR4 3600
  Software:
  - OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
  - Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
  - Network: tailscale
  - Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
  - LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
  - Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
  Models:
  - Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
  - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
  - gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
  Flags (specific for Qwen 27b, since that's primary model):
  - `-ngl 99` offload all layers to GPU
  - `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
  - `-np 1` single slot (no parallel request handling)
  - `--no-context-shift` error instead of silently sliding the context window when full
  - `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
  - `-b 2048` logical batch size (tokens per submission)
  - `-ub 1024` physical micro-batch (per GPU pass)
  - `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
  - `-fa on` flash attention
  - `--spec-type draft-mtp` use the model's built-in MTP as the draft model
  - `--spec-draft-n-max 3` propose up to 3 draft tokens per step
  - `--spec-draft-n-min 0` allow zero drafts if confidence is low
  - `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
  - `--reasoning-format deepseek` parse <think> blocks in proper format
  - `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
  - `--jinja` use the GGUF's Jinja chat template
  - `--temp 0.6` moderate randomness (Qwen recommended value for coding)
  - `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
  - `--top-k 20` top-20 candidates (Qwen recommended value for coding)
  - `--min-p 0.0 disabled (Qwen recommended value for coding)
  Performance (27b, primary model):
  - ~65t/s for token generation
  - ~600 t/s for prompt processing.
  - If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
  - ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
  I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
  CLI/Harness:
  - Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
  - Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window
  - Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
  A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
  This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
  Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
_davide_2 hours ago
i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/
- sosodevan hour ago
  My strix halo board is feeling more useful and less toylike with the recent performance gains combined from MTP, better quantization, and generalized performance improvements across the stack. For example, I can run Unsloth's Gemma4-31B 4-bit QAT model with around 30tg and 200pp. I don't find that to be too slow at all. Particularly because it's nearly full accuracy and good enough for a lot of different stuff I throw at it.
  I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.
SkitterKherpian hour ago
It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.
Razengan2 hours ago
Related: Are there any viable distributed AI models?
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
- SimianSci3 minutes ago
  This is unlikely to happen in any meaningful fashion for quite some time.
  (TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)
  Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.
  To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.
  Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense. I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.
  The dream of distributed AI is a ways off.
- an hour ago
  undefined
- joshuamoyersan hour ago
  I think it'd be very hard to achieve viable tokens/s or get arithmetic intensity to be high enough in general, since many cases in existing training and inference are memory bandwidth limited. Definitely seems possible to conceptually have a slow pipeline that is distributed though.
fortysevenan hour ago
I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.
system2an hour ago
Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.
christkv2 hours ago
Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.
dude2507112 hours ago
Yes, running a local model on a natural wetware substrate here.
Recommended setup: plenty of nutrients, some caffeine and a quiet environment.
Performance - not currently measured in tokens: roughly average.
- jasongillan hour ago
  I have been running this stack since well before Claude Code became popular. It works OK but I've found it to be very slow; and despite having a big context window, it seems to lose track of what it's working on and goes down a rabbit hole (or just wastes tokens trying to use the web browser) for hours and is hard to get back on track. I even tried spinning up two sub-agents but even after years of trying to prompt them, they are almost useless in terms of coding ability, so that is looking to be a waste of spending at least so far but maybe the model will improve as time goes on.
- HPsquared2 hours ago
  I personally get about 50 tokens per hour.
temilsonan hour ago
[flagged]
phlhar2 hours ago
[dead]
iluvcommunism2 hours ago
[dead]
kertoip_12 hours ago
Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations
dada216an hour ago
Local? No. Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200