Furthermore, the model they recommend doesn't quite reach ~gpt-5.4-mini level performance- that quality dip means you may as well just pay for something like Kimi K2.6 via openrouter if you want a something ~>= sonnet 4.6 in performance as a backup for when you run out of anthropic/openai usage.
However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.
llama.cpp uses a unified KV cache that is shared between requests (be they happening in parallel or not). As new requests come in, they'll evict no longer referenced branches, then move to evict the least recently used entry, and so on.
If you come back to a session that's been evicted it will just be parsed again. This is a problem only on very long context sessions, but it can still be a problem to you.
So one way to reduce such evictions (and reduce KV cache size significantly as a bonus) is to reduce the number of kv cache checkpoints.
Checkpoints allow you to branch a session at any point and not have to recompute it from the start. If you find that you rarely branch a conversation, or if you rely entirely on a coding harness, then setting ctx-checkpoints to 0 or 1 will save tons of VRAM and allow more different sessions to stay in VRAM. This is especially true for models with very large checkpoints (such as Gemma 4).
Local AI only makes sense for a couple of use cases:
- Privacy
- Constant churning on tokens
- Latency
- Availability
Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)If usage-based pricing is killing your vibe, find a cheaper subscription with higher limits. Here's a list of them compared on price-per-request-limit: https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...
If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.
Compaction doesn't save them money, it just makes it possible for you to continue a session. If you compact a session too many times, besides the fact that the model basically stops being useful, you eventually just cannot do anything else in the session because all the context is taken up by compaction notes. But if you don't compact it, pretty soon the session is completely unusable because it can't output any more tokens. You can disable compaction in those agents if you want to see the difference.
Also, using a lot of context can make the model perform poorly, so compaction can improve results. If you have a much larger context size, it means you have more headroom before the model starts to perform poorly (as it grows closer to max context size). A larger context also lets you do things like handle larger documents or reason over a larger amount of data without having to break it up into subtasks. Eventually we want models' context to get much bigger so we can do more things in a session. (Some research is being done to see if we can get rid of the limit entirely)
Anyways, I don't know specifics well enough to argue with you on anything, but there is a cost for input tokens, and you see/pay it when you use the API directly or through OpenRouter. Maybe you looked at the leaked source for the Claude Code and can tell me definitively otherwise, but Anthropics and OpenAI's incentives for when to compact are not always aligned with the users depending on pricing plans.
Subscription plans are the "first hit is free" plans. Real pricing once subscriptions are phased out in a year or two is gonna be orders of magnitude more.
As for why, why would you not? Sitting around waiting for a single assistant is inefficient use of time; I tend to have more like 4-10 instances running in parallel.
I also do not run 10 agents at the same time. There's no way I could keep up with the volume of work from doing that in any meaningful way
You don't run 10 agents to get more volume of work. You run 10 agents to get better quality work
I'd imagine plenty of people have a problem with trusting fly-by-night inference providers or model owners with opt-out policies [1] [2] about training on your data, who would be more than happy to send data to EC2, or even the same models in Amazon Bedrock.
[1]: https://github.blog/news-insights/company-news/updates-to-gi...
[2]: https://help.openai.com/en/articles/5722486-how-your-data-is...
The new ones running on a 16GB M1 are maybe GPT-4 level (with decent performance to be fair).
I wonder if it's possible to make some hyper-overturned model that, say, does nothing but program in Python get SOTA-ish performance in that narrow task.
For most of my questions and 8-9b model works great. Upshot is not having chatgpt/meta sell my data or target me with random thoughts later.
Qwen3.6 is pretty amazing for a 27B model, but it's not hard to run into its limits. With a Radeon R9700 and unsloth's 6-bit quantization, I get ~20 TPS and 110k context, so it can do a fair bit quickly.
Not sure what Claude opus or sonnet run at. I know when it goes offline it's 0 tokens per second
Fortunately, I don't have to pick one or the other - instead I run Qwen 3.6 35B A3B. It's a bit slow with my 8gb GPU (I'm in the process of getting a bigger one) but again, to me the choice isn't "what's the best I can get", it's "what's the best local I can get".
Cost is not a reason to go local.
I also wonder what quantization you are using? If you haven't tried other quants I really would
You can get work done with them if you have a harness that can drive outcomes without needing feedback (I've been building a tdd red to green agent harness lately that is very effective if given a good plan upfront). So if you can stand waiting a few days to see results that would only take hours with a model deployed to frontier nvidia hardware, you can get results this way.
"./claude-2.1.126-linux-x64
Welcome to Claude Code v2.1.126
Unable to connect to Anthropic services
Failed to connect to api.anthropic.com: ECONNREFUSED
Please check your internet connection and network settings.
Note: Claude Code might not be available in your country. Check supported countries at https://anthropic.com/supported-countries"
Let me also add that most of services that are private, will connect to the internet. LMStudio and many others will try to get a connection and all others. I don't remember a single one that does not connect to their servers and send some kind of information.