its alot of money to be spending for a single 9 of reliablility.
Im seriously over the current claude experience. After seemingly fixing my 4.6 usage by disabling adaptive thinking and moving to max effort, it seems that the release of 4.7 has broken that workflow and Im 99% certain that disabling adaptive thinking does nothing even on 4.6 now. Just egregious errors in 2 days this week after coming back from vacation.
If you don’t mind an opinionated harness that asks for a pretty specific workflow, but one that works well, use OpenCode.
If you want to spread your wings and feel the sweet kiss of freedom, use Pi.
I started developing my own coding agent after using Pi for a couple months, so I’m curious what you don’t like about pi.
To save others a click, though the article is worth reading.
He also mentions no subagents by default in pi as well.
Largely though i'm happy with Claude Code w\ IDE integration, so i don't feel the need to migrate. Nonetheless i'm curious.
https://support.claude.com/en/articles/9797531-what-is-the-e...
Local models sound great until you realize you dont get alot of the features that we implicitly expect from hosted models. Many things would require additional investment into the operations and setup to get to a comparable system. We ended up wanting things that would require us to roll our own memory system, harnesses for the model, compliance needs, and security. It was possible for us to invest in this, but it would require additional investment in hiring or training to get us to a state comparable to the hosted options.
Eventually, I had to recommend against the project as it was more likely to be an investment in the leading team's resume, than an actual investment into our organization.
Your last paragraph hints at retention struggles which complicates the issue.
But was vendor mitigation not part of the evaluation? I get that most companies view governance and compliance as a pay to play issue, but there has always been an issue with rapidly changing areas and single source suppliers.
I admit to having my own preferences and being almost completely ignorant about what your needs are, but I have seen the value in having a rabbit to pull out of the hat.
If employee retention doesn’t allow for departure of individuals without complete loss of institutional knowledge I guess my position wouldn’t hold.
But during the rise of cloud computing I introduced an openstack install in our sandbox, not because I thought that we would stay on a private cloud but because it allowed our team to pull back the covers and understand what our cloud vendor was doing.
It was an adoption accelerator that enabled us to choose a vendor that was appropriate and to avoid the long tail of implementation.
I was valuable as a pivot when AMD killed seamicro with short notice, and the full cloud migration period was dramatically shortened.
I have a dozen other examples, but it is like stock options, volatility and uncertainty dramatically increase the value of keeping your options open.
We will have vendors fold, and a single source only story couples you org to the success of that vendor.
IMHO There is a huge difference between tying your success to an Oracle, who may be ‘safe’ if expensive as a captive customer and doing the same in uncertain markets.
Would you be willing (or able) to share more?
better to take the risk for most things. if the worst case happens and you have to migrate, you migrate. otherwise you risk overengineering upfront and guaranteeing reduced productivity rather than risking it
That's not local models vs hosted models, that's using the enterprise services from Anthropic. Any local LLM inference engine such as VLLM gives you an OpenAI compatible API with the exact same features as a hosted model.
I'm not sure what your use case is, but I personally found Anthropic's offerings lacking and inferior to open source or custom-built solutions. I have yet to see any "memory" system that's better than markdown files or search, and harnesses for agentic AIs are dime a dozen.
That level of hardware, if the performance was enough is a much smaller investment and gamble.
Either way I understand the decision. Your product isn't in locally hosted LLMs, why fuss. That said I see 1 million plus in external spend I start wondering about the options. Not saying you did the wrong thing, I think you did the right thing but things seem to be changing on the local model front and quite rapidly.
The people who don't really know what they're doing (or don't care) need the full power of the SOTA models, those with experience can provide enough context and instruction to make even small local models work.
Look at how other companies are suffering massive outages due to LLMs too like AWS and Cloudflare. Two companies that use to be the best in the industry at uptime but have suddenly faltered quite quickly.
Companies that have even worse standards will quickly realize how problematic these tools are. Hopefully before a recession because this industry seems to be allergic to profitable businesses and leaders that have been around since ZIRP have shown zero intelligence in navigating these times.
We did, however, determine that at least one of them (and perhaps all) would have been easily caught by AI code reviewers, had AI code reviewers been in use. So now we mandate that. And honestly, I love it, the AI reviewer spots all sorts of things that humans would probably miss.
(We also fixed a number of problems around configuration that would roll out globally too fast, leaving no time to notice errors and stop a bad rollout, as well as cases where services being down actually made it hard to revert the change... should be in a much better place now. But again, none of that had to do with LLMs.)
Is that true? At least one of them seemed to involve LLM-written code from what I saw. (Not to say that human error wasn't _also_ a contributing factor, but I wouldn't say it had _nothing_ to do with LLMs).
> We did, however, determine that at least one of them (and perhaps all) would have been easily caught by AI code reviewers, had AI code reviewers been in use. So now we mandate that. And honestly, I love it, the AI reviewer spots all sorts of things that humans would probably miss.
The reviewer is decent, but the false positive rate is substantial, and the false negative rate is definitely nonzero. Not that you would know that the way our genius CTO talks about it...
Not sure how much of a productivity gain a 2.5 million per year it is?
This is the brutal reality; even with the crazy reliability issues, demand is still far outstripping supply at the current price.
What yet needs to be seen is if that demand sustains in the long run at that price point or flattens out proving to be super elastic given that there are many other providers that are catching up pretty fast.
Out of curiosity, do you actually use it 24/7? The world doesn't collapse every time o365 goes down... (which is also pretty often)
You will also find that many problems in the harder sciences do not get easier by throwing more bodies at them. Comments like these remind me that some project managers think they'd be able to delivery a baby in 1 month if they simply had 9 women.
I'd have to disagree. There's a narrow band in the middle where that's true, but once you exceed that, your personal inputs and outputs matter less and less, and the contributions you make to the overall workplace, and how well you enable those around you, make a larger part of why you're compensated.
Even as an IC, the more you're able to mentor and elevate the people around you, the more your compensation will grow (if you're in the right place, and thus already at the right earnings bracket)
That's not how successful (software, in this case) teams are made.
What local alternative could replace your Anthropic use? I have found none. I don't think many have, which is why most of us pay Anthropic, rather than using one of the numerous, far cheaper, cloud services that host "local" class models.
Most of us are paying for access to proprietary SOTA models, rather than hosting.
And yet they will continue to spend wheelbarrows full of money with Anthropic because they want so badly to reach the point where they can fire you.
Doesnt seem to us to be wheelbarrows of money, when you consider the average AWS/Azure bill.
> the increases that we have seen suggest a better ROI than if we had hired 12 developers.
You can’t argue “we were able to get away with not hiring more developers” and also say you aren’t replacing labor.
Morally I trend towards your side of things, but it’s also important to be realistic about what you’re actually doing. Money is going towards Anthropic and not towards new hires. That’s a replacement of labor. It doesn’t matter what the end goal was.
Hardly baseless when people have been gloating about how programming as a job is ending any day now for the last year at least.
> Doesnt seem to us to be wheelbarrows of money, when you consider the average AWS/Azure bill.
You didn’t mention the size of the company so yeah.
I’m glad your leadership isn’t trying to fire everyone. But in case you live under a rock, tech layoffs are at all time highs. Companies are rewarded by the public markets for laying off workers.
Simultaneously we have AI industry leaders warning of an employment apocalypse once AGI is achieved.
And you think it’s baseless. Have some class bro.
Besides, codex wasn't always the answer.
That's silly. It's a JavaScript app, they are more or less open source by design. There was no secret sauce in Claude Code.
But glad my team is staying nimble and has multi-model (Anthropic, Codex, Gemini), multi-modal (desktop, CLI/TUI, web) dev tooling.
As our actual coding skills collectively atrophy, we'll either need to switch tools or go for a walk when the LLM is down.
In the cloud era I advised against a multi-cloud strategy, as the effort to impact just wasn't there. But perhaps this is different in the LLM era, where the cost of switching is pretty darn low.
Some of the comments here mention their monthly spend, and it’s eye watering.
LLM requests that do not call tools do not need anything external by definition.
No central server, nothing, they can even survive without the context cache.
All you need is to load (and only once!) the read-only immutable model weights from a S3-like source on startup.
If it takes 4 servers to process a request, then you can group them 4 by 4, and then send a request to each group (sharding).
Copy-paste the exact same-setup XXX times and there you have your highly-parallelizable service (until you run out of money).
It's very doable, any serious SRE can find a way setup "larger than one card" models like Kimi or DeepSeek (unquantized) if they have a tightly-coupled HPC (or a pair of very very beefy servers).If you run out of servers, then again a money problem, but not an architectural problem (and modern datacenters are already scalable).
Take the best SRE, but no budget, and there is no solution.
So inference is the easy part.
Codex or Claude Code if it takes lot of time or have slow cold latency, it's considered very acceptable.
Some users would probably not even see the difference if a request takes 2 minutes versus 3 minutes.
The real difficult part is to have context caching and external tools, because now you are depending on services that might be lagging.
Executing code, browsing the web, all of that is tricky to scale because they are very unreliable (tends to timeout, requires large cache of web pages, circumventing captchas, etc).
These are traditional scaling problems, but they are more difficult because all these pieces are fragile and queues can snowball easily.hardware capacity constraints is going to be the big one
Effective caching is another, I bet if you start hitting cold caches the whole things going to degrade rapidly.
The ground is probably shifting pretty rapidly.
Power users are trying to get the most out of their subscriptions and so are hammering you as fast as they possibly can. See Ralph loops.
Harnesses are evolving pretty rapidly, as well as new alternatives harnesses. Makes the load patterns less predictable, harder to cache.
The demand is increasing both from more customers, but also from each user as they figure out more effective workflows.
Users are pretty sensitive to model quality changes. You probably want smart routing, but users want the best model all the time.
Models keep getting bigger and bigger.
On top of that they are probably hiring more onboarding more, system complexity and codebase complexity is growing.
I can't even send them an angry message because clicking "Get help" does nothing.
Posting with a fresh account because I'm not supposed to share these details for obvious reason. If you want help on setting this up, just reply with a way to reach you.
Can you share what models you run and find best performing for this setup? That would help a lot. I already run a smaller AI server in the office but only 32b models fit there. I already have experience optimizing inference, I'm just interested what models you think are great for 8xH100 for coding, I'll figure out the details how to fit it :)
We're also looking into how to do some secure cost sharing with this so that all people need to pay for are what it costs for us to run everything! We're just planning on reserving at least 51% of the capacity for us and the rest for everyone else.
I actually respect this a ton, good work.
Anyway, this is the internet and skepticism is warranted :D.
If you haven't done so already, finetune the model on all your company's code that you can get your hands on. This is one of the great advantages that you get when running local models. I like the style of the generated code much better now, I have to rewrite much less, and my prompts can be shorter too. But maybe these already are the "tweaks" that you mentioned.
Sorry if this is a stupid question, I've never finetuned or trained a LLM.
They should ask Codex now that Claude Code is down.
I'm looking into how to structure my work to run some autonomous-safe jobs overnight to take advantage of it.
What are you doing with the authentication servers? This isn't the first downtime I've seen caused by that.
Good thing I checked Hacker News first
Oh no wait... the outage is with out AI itself, so how can AI help? Allow me to re-evaluate.
Fublutenuating...
Yes, let's ask AI!
Oh no wait... the outage is with AI itself, I already correctly identified this above.
Bubbluating...
It seems you will have to rely on your engineering skills to solve this problem yourself, ie, you're cooked! I will auto-renew your subscription to ensure you can be sure you'll have access to AI to solve this problem if it ever comes back online.
No!
Comboculating...
I apologize for the misunderstanding, I have deleted your project. I am sorry, would you like me to restart everything from scratch ?
Clearly, half of Anthropic should have subscriptions to OpenAI or Mistral or whatever China sells.
Have telegram set up and plotting to take over the world
Start doing post mortems then!
At the very least, them using any off the shelf service that's shitting the bed would inform others to stay away from it - like an IAM solution, or maybe a particular DB in a specific configuration backing whatever they've written, or a given architecture for a given scale.
Right now it's completely like a black box that sometimes goes down and we don't get much information about why it's so much less stable than other options (hey, if they just came out and said "We're growing 10x faster than we anticipated and system X, Y and Z are not architected for that." that'd also be useful signal).
Or, who knows, maybe it's just bad deploys - seems like it's back for me and claude.ai UI looks a bit different hmmm.
> I'm working with Claude Code on session aaaaaaaa-bbbb-1223-3445-abcdefabcdef which I'd like to hand-off to you, do you know how to read the session, my input and Claude's output so we can resume where I left off?
gpt-5.5, medium effort. "Resumed" session fully in under 2 minutes. Outages like today's are so common that I've now got the time to re-evaluate Codex every other day.
Who is We? I thought software engineers were going to be redundant and AI could do it all itself? (not to take anything away from Claude code + Claude both of which I love)
I'll just go for a walk outside.
And I don't mean "if I can't access Claude to do my work", I mean, just in general - I'll just ping claude.ai from time to time and use Claude's breaks as a break reminder.
Why should AI get a breather and not us?
Reminds me of the early days of World of Warcraft, when servers went down frequently because Blizzard couldn't keep up with all the load. Everyone was frustrated but of course nobody stopped playing.
So there was a recent article that I read which said that claude is now trading at a trillion dollars (yes with a T) evaluation in private markets.
We are definitely creating corporations and people which depend on AI companies themselves and the reliability of these tools is certainly a question worth asking. I am seeing quite many downtimes in products like github and claude being shown on Hackernews multiple times.
Is there a life cycle of enshittenification of such products which grow too valuable? What are (are there?) some practical lessons for such scalability that these trillion dollar companies are missing or is it just a dose of reality that such massive corporations can't compete with downtime with even my 7$/yr vps?
My question is, Is this an engineering roadblock with its limits in reality for or a management/entreprise roadblock for low downtime?
But seriously: while I don't use Claude, this issue of perceived unreliability seems to be approaching the point of existential risk for Anthropic. Whats the theory about why they're struggling? Compute capacity? Load? Lack of focus on SRE?
Put it another way: is their downtime due to something fundamental about serving inference, or just bad engineering choices? Given their resources, it seems astonishing.
Luckly Qwen3.6 35B A3B Local LLM works fine also when Claude is offline