Is this a symptom of the same phenomenon behind the deluge of disposable JavaScript frameworks of just ten years ago? Is it peer pressure, fear of missing out? At its root, I suspect so; of course I would imagine it's rare for the C-suite to have ever mandated the usage of a specific language or framework, and LLMs represent an unprecedented lever of power to have an even bigger shot at first mover's advantage, from a business perspective. (Yes, I am aware of how "good enough" local models have become for many.)
I don't really have anything useful nor actionable to say here regarding this dialling back of capability to deal with capacity issues. Are there any indications of shops or individual contributors with contingency plans on the table for dialling back LLM usage in kind to mitigate these unknowns? I know the calculus is such that potential (and frequently realised) gains heavily outweigh the risks of going all in, but, in the grander scheme of time and circumstance, long term commitments are starting to be more apparently risky. I am purposefully trying to avoid "begging the question" here; if instead of LLMs, this were some other tool or service, reactions to these events would have been far more pragmatic, with less of a reticence to invest time on in-house solutions when dealing with flaky vendors.
It seems like every LLM thread for the past couple years is full of posts saying that the latest hot AI tool/approach has made them unbelievably more productive, followed by others saying they found that same thing underwhelming.
I don't think many of you have legitimately tried Claude Code, or maybe you're holding it wrong.
I'm getting 10x the work done. I'm operating at all layers of the stack with a speed and rapidity I've never had before.
And before anyone accuses me of being some "vibe coder", I've built five nines active-active money rails that move billions of dollars a day at 50kqps+, amongst lots of other hard hitting platform engineering work. Serious senior engineering for over a decade.
This isn't just a "cool technology". We've exited the punch card phase. And that is hard or impossible to come back from.
If you're not seeing these same successes, I legitimately think you're using it wrong.
I honestly don't like subscription services, hyperscaler concentration of power, or the fact I can't run Opus locally. But it doesn't matter - the tool exists in the shape it does, and I have to consume it in the way that it's presented. I hope for a different offering that is more democratic and open, but right now the market hasn't provided that.
It's as if you got access to fiber or broadband and were asked to go back to ISDN/dial up.
I just don’t see how I could export 10x the work and have it properly validated by peers at this point in time. I may be able to generate code 10-20x faster, but there are nuances that only a human can reason about in my particular sector.
When I do code, it's almost always something novel that I don't know how I'm going to implement until I code a few pieces and see how they fit together. If it's a fairly routine feature based on an existing pattern, I assign it to one of the other devs.
In my experience, the people who 10X their output with Claude Code fit one of two categories:
1. They're not really taking the time to understand the code they're submitting. They might do a skim over the output and see that it looks reasonable and passes tests, but they aren't taking time to understand the code as if they were pair programming. Only when it breaks and the LLM can't patch it up quickly do they go in and fully understand the code.
2. They moved very slowly before Claude Code. I've had some coworkers who would take 2-3 days to get a simple PR out because, to be frank, their work days weren't full of a lot of work. Every time they'd run into a question they'd stop and then bumble around for a few hours until they could talk to the ticket creator about it. They'd get tired of working on a task by 2PM and then save the rest of the work for tomorrow. They'd get an idea and decide to rewrite the PR the next day, and on and on with distractions. When they start using Claude Code the LLM doesn't have the same holdups, so now every time where they were getting stuck or tired before is replaced by an LLM powering through to some solution. Their cognitive load is reduced so they're no longer freezing up during the day. They aren't really becoming 10X engineers like they think, but really just catching up to normal pace
Another commenter mentioned that Docker, git, etc. were all tools that greatly enhanced productivity and coding agents are just another tool that does that. I would agree, but argue that it's more impactful than all of those tools combined.
[1] https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...
I spend a lot of time reviewing any code that comes out of Claude Code. Even using Opus 4.6 with max effort there is almost always something that needs to be changed, often dramatically.
I can see how people go down the path of thinking "Wow, this code compiles and passes my tests! Ship it!" and start handing trust over to Opus, but I've already seen what this turns into 6 months down the road: Projects get mired down in so much complexity and LLM spaghetti that the codebase becomes fragile. Everyone is sidetracked restructuring messy code from the past, then fighting bugs that appear in the change.
I can believe some of the more recent studies showing LLMs can accelerate work by circa 20% (1.2X) because that's on the same order of magnitude that I and others are seeing with careful use.
When someone comes out and claims 10X more output, I simply cannot believe they're doing careful engineering work instead of just shipping the output after a cursory glance.
I can use the agent to scaffold a lot of test/demo frameworks around the pieces I'm working on pretty cleanly and have the agent fill in. I still spend a lot of time validating the tests and the code being completed though.
The errors I tend to get from the agent are roughly similar to what I might see from a developer/team that works remotely... you still need to verify. The difference is the turn around seems to be minutes over days. You're also able to observe over simply review... When I see a bad path, I can usually abort/cancel, revert back to the last commit and try again with more planning.
After I solved entrepreneurship I decided to retire and I now spend my days reading HN, posting on topics about AI.
"I gotta be present." Me: Reenacting the Malcolm Reynolds too many responses meme.
I mostly believe you. I have seen hints of what you are talking about.
But often times I feel like I’m on the right track but I’m actually just spinning when wheels and the AI is just happily going along with it.
Or I’m getting too deep on something and I’m caught up in the loop, becoming ungrounded from the reality of the code and the specific problem.
If I notice that and am not too tired, I can reel it back in and re-ground things. Take a step back and make sure we are on reasonable path.
But I’m realizing it can be surprisingly difficult to catch that loop early sometimes. At least for me.
I’ve also done some pretty awesome shit with it that either would have never happened or taken far longer without AI — easily 5x-10x in many cases. It’s all quite fascinating.
Much to learn. This idea is forming for me that developing good “AI discipline” is incredibly important.
P.s. sometimes I also get this weird feeling of “AI exhaustion”. Where the thought of sending another prompt feels quite painful. The last week I’ve felt that a lot.
P.p.s. And then of course this doesn’t even touch on maintaining code quality over time. The “after” part when the LLM implements something. There are lots of good patterns and approaches for handling this, but it’s a distinct phase of the process with lots of complexities and nuances. And it’s oh-so-temping to skip or postpone. More so if the AI output is larger — exactly when you need it most.
In all seriousness though, writing code, or even sitting down and properly architecting things, have never been bottlenecks for me. It has either been artificial deadlines preventing me from writing proper unit tests, or the requirement for code review from people on my team who don't even work on the same codebase as I do on a daily basis. I have often stated and stand by the assertion that I develop at the speed of my own understanding, and I think that is a good virtue to carry forth that I think will stand the test of time and bring about the best organisational outcomes. It's just a matter of finding the right place that values this approach.
Edit for context: My team is an ops team that needed a couple developers; I was picked to implement some internal tooling. The deadlines I was given for the initial development are tied directly to my performance evaluation. My boss has only ever been a manager for almost two years. He has only ever had development headcount for less than a year. He has never been on a development team himself. The man does not take breaks and micromanages at every opportunity he gets. He is paranoid for his job, thinking he is going to be imminently replaced by our (cheaper) EU counterparts. His management style and verbal admonitions reflect this; he frequently projects these insecurities onto others, using unnecessarily accusatory speech. I am not the only developer on my team who has had such interactions with him. I have screenshots of conversations with him that I felt necessary to present to a therapist. This degree of time pressure is entirely unprecedented in my 20 year career. Yes, this is a dysfunctional environment.
I have never experienced this, and it sounds remarkably dysfunctional to me.
I've tried everything I can to cope and am not sure I will be willing to return to that team once I am past my medical leave.
I struggle to believe that a ton of seemingly intelligent software engineers are too dumb to figure out how to use Claude code to get reliable results, it seems much more likely to me that it can do well at isolated tasks or new projects but fails when pointed at large complex code bases because it just... is a token predictor lol.
But yeah spinning up a green fields project in an extensively solved area (ledgers) is going to be something an AI shines at.
It isn't like we don't use this stuff also, I ask Cursor to do things 20x a day and it does something I don't like 50% of the time. Even things like pasting an error message it struggles with. How do I reconcile my actual daily experience with hype messages I see online?
Many software devs work in teams on large projects where LLMs have a more nuanced value. I myself mostly work on a large project inside a large organization. Spitting out lines of code is practically never a bottleneck for me. Running a suite of agents to generate out a ton of code for my coworkers to review doesn't really solve a problem that I have. I still use Claude in other ways and find it useful, but I'm certainly not 10x more productive with it.
I couldn't disagree with this more. It's impressive at building demos, but asking it to build the foundation for a long-term project has been disastrous in my experience.
When you have an established project and you're asking it to color between the lines it can do that well (most of the time), but when you give it a blank canvas and a lot of autonomy it will likely end up generating crap code at a staggering pace. It becomes a constant fight against entropy where every mess you don't clean up immediately gets picked up as "the way things should be done" the next time.
Before someone asks, this is my experience with both Claude Code (Sonnet/Opus 4.6) and Codex (GPT 5.4).
So it's not that they're too stupid. There are various motivations for this: clinging on to familiarity, resistance to what feels like yet another tool, anti-AI koolaid, earnestly underwhelmed but don't understand how much better it can be, reacting to what they perceive to be incessant cheerleading, etc.
It's kind of like anti-Javascript posts on HN 10+ years ago. These people weren't too stupid to understand how you could steelman Node.js, they just weren't curious enough to ask, and maybe it turned out they hadn't even used Javascript since "DHTML" was a term except to do $(".box").toggle().
I wish there were more curiosity on HN.
Hypothetically, you have a simple slice out of bounds error because a function is getting an empty string so it does something like: `""[5]`.
Opus will add a bunch of length & nil checks to "fix" this, but the actual issue is the string should never be empty. The nil checks are just papering over a deeper issue, like you probably need a schema level check for minimum string length.
At that point do you just tell it like "no delete all that, the string should never be empty" and let it figure that out, or do I basically need to pseudo code "add a check for empty strings to this file on line 145", or do I just YOLO and know the issue is gone now so it is no longer my problem?
My bigger point is how does an LLM know that this seemingly small problem is indicative of some larger failure, like lets say this string is a `user.username` which means users can set their name to empty which means an entire migration is probably necessary. All the AI is going to do is smoosh the error messages and kick the can.
Then work on making sure the LLM has all the info it needs. In this example it sounds like perhaps your hypothetical data model would need to be better typed and/or documented.
But yeah as of today it won't pick up on smells as you do, at least not without extra skills/prompting. You'll find that comforting or annoying depending on where you stand...
`the error is on line #145 fix it with XYZ and add a check that no string should ever be blank`
It's the randomness that is frustrating, and that the fix would be quicker to manually input that drives me crazy. I fear that all the "rules" I add to claude.md is wasting my available tokens it won't have enough room to process my request.
I think Claude makes me faster, but the struggle is always centered around retaining own context and reviewing code fully. Reviewing code fully to make sure it’s correct and the way I want it, retaining my own context to speed up reviews and not get lost.
I firmly believe people who are seeing massive gains are simply ignoring x% lines of code. There’s an argument to be made for that being acceptable, but it’s a risk analysis problem currently. Not one I subscribe to.
You get a better solution but also a plan file that you can review. And, also important, have another agent review. I've found that Codex is really good at reviewing plans.
I have an AGENTS.md prompt that explains that plan file review involves ranking the top findings by severity, explaining the impact, and recommending a fix to each one. And finally recommend a simpler directional pivot if one exists for the plan.
So, start the plan in Claude Code, type "Review this plan: <path>" in Codex (or another Claude Code agent), and cycle the findings back into Claude Code to refine the plan. When the plan is updated, write "Plan updated" to the reviewer agent.
You should get much better results with this capable of much better arch-level changes rather than narrow topical solutions.
If that's still not working sufficiently for you, maybe you could use more support, like a type-system and more goals in AGENTS.md?
For new features, I spend a bit of time thinking, and I can usually break it down in smaller tasks that are easy to code and verify. No need to wrangle with Plan mode and a big markdown file.
I can usually get things one-shotted by that point if I bother with the agent.
2. I architecturally describe every change I want made. I don't leave it up to the LLM to guess. My prompts might be overkill, but they result in 70-80ish% correctness in one shot. (I haven't measured this, and I'm actually curious.) I'll paste in file paths, method names, struct definitions and ask Claude for concrete changes. I'll expand "plumb foo field through the query and API layers" into as much detail as necessary. My prompts can be several paragraphs in length.
3. I don't attempt an entire change set or PR with a single prompt. I work iteratively as I would naturally work, just at a higher level and with greater and broader scope. You get a sense of what granularity and scope Claude can be effective at after a while.
You can't one shot stuff. You have to work iteratively. A single PR might be multiple round trips of incremental change. It's like being a "film director" or "pair programmer" writing code. I have exacting specifications and directions.
The power is in how fast these changes can be made and how closely they map to your expectations. And also in how little it drains your energy and focus.
This also gives me a chance to code review at every change, which means by the time I review the final PR, I've read the change set multiple times.
Otherwise you should switch to haskal since it makes logic errors and bugs mathematically impossible.
Seemingly is doing the heavy lifting here. If you read enough comment threads on HN, it will become obvious why they aren’t getting results.
They're not dumb, but I'm not surprised they're struggling.
A developer's mindset has to change when adding AI into the mix, and many developers either can’t or won’t do that. Developers whose commits that look something like "Fixed some bugs" probably aren’t going to take the time to write a decent prompt either.
Whenever there's a technology shift, there are always people who can't or won't adapt. And let's be honest, there are folks whose agenda (consciously or not) is to keep the status quo and "prove" that AI is a bad thing.
No wonder we're seeing wildly different stories about the effectiveness of coding agents.
One is that they tried AI-based coding a year or two ago, came to the IMHO completely correct at that time conclusion that it was nearly useless, and have not tried it since then to see that the situation has changed. To which the solution is, try it again. It changed a lot.
The other are those who have incorporated into their personal identity that they hate AI and will never use it. I have seen people do things like fire AI at a task they have good reasons to believe it will fail at, and when it does, project that out to all tasks without letting themselves consciously realize that picking a bad task on purpose skews the deck.
To those people my solution is to encourage them to hold on to their skepticism. I try to hold on to it as well despite the incredible cognitive temptation not to. It is very useful. But at the same time... yeah, there was a step change in the past year or so. It has gotten a lot more useful...
... but a lot of that utility is in ways that don't obviate skilled senior coding skills. It likes to write scripting code without strong types. Since the last time I wrote that, I have in fact used it in a situation where there were enough strong types that it spontaneously originated some, but it still tends to write scripting code out of that context no matter what language it is working in. It is good at very straight-line solutions to code but I rarely see it suggest using databases, or event sourcing, or a message bus, or any of a lot of other things... it has a lot of Not Invented Here syndrome where it instead bashes out some minimal solution that passes the unit tests with flying colors but can't be deployed at scale. No matter how much documentation a project has it often ends up duplicating code just because the context window is only so large and it doesn't necessarily know where the duplicated code might be. There's all sorts of ways it still needs help to produce good output.
I also wonder how many people are failing to prompt it enough. Some of my prompts are basically "take this and do that and write a function to log the error", but a lot of my prompts are a screen or two of relevant context of the project, what it is we are trying to do, why the obvious solution doesn't work, here's some other code to look at, here's the relevant bugs and some Wiki documentation on the planning of the project, we should use {event sourcing/immutable trees/stored procedures/whatever}, interact with me for questions before starting anything. This is not a complete explanation of what they are doing anymore, but there's still a lot of ways in which what an LLM can really do is style transfer... it is just taking "take this and do that and write a function to log the error" and style-transforming that into source code. If you want it to do something interesting it really helps to give it enough information in the first place for the "style transfer" to get a hold of and do something with. Don't feel silly "explaining it to a computer", you're giving the function enough data to operate on.
Need some help selling these notepad apps, do you have a prompt for that?
I'm surprised nobody thought of it before me but basically the LLM's are trained on the internet and I just had it spit back out everything.
It's running in parallel so I can validate it, which of course I'm using LLM's to do that.
Once it's ready I will put it on the market, but get this, my internet will be cheaper than the current internet. I'll probably just make it one cheaper, like if the current internet costs, for example, 7, I'll make my internet cost 6.
The challenge now is how to plan architectures and codebases to get really big and really scale, without AI slop creating hidden tech debt.
Foundations of the code must be very solid, and the architecture from the start has to be right. But even redoing the architecture becomes so much faster now...
I'm just curious, why do you "have to"? Don't get me wrong, I'm making the same choice myself too, realizing a bunch of global drawbacks because of my local/personal preference, but I won't claim I have to, it's a choice I'm making because I'm lazy.
I could pay API prices for the same models, but aside from paying much more for the same result that doesn't seem helpful
I could pay a 4-5 figure sum for hardware to run a far inferior open model
I could pay a six figure sum for hardware to run an open model that's only a couple months behind in capability (or a 4-5 figure sum to run the same model at a snail's pace)
I could pay API costs to semi-trustworthy inference provider to run one of those open models
None of those seem like great alternatives. If I want cutting-edge coding performance then a subscription is the most reasonable option
Note that this applies mostly to coding. For many other tasks local models or paid inference on open models is very reasonable. But for coding that last bit of performance matters
I'm given a tool that lets me 10x "provide value".
My personal preferences and tastes literally do not matter.
What is “using it right”? You wrote claims, but explain nothing about your process. Anything not reproducible is either luck or lie.
You sound like a pro wrestler. I'd like to know what "hard-hitting" engineering work is. Hydraulic hammers?
It's also like.... difficult to honestly and accurately measure. And account for whether or not you're getting lucky based on your underlying dependencies (servers, etc) not crashing as much as advertised, or if it's actually five nines. Or whether you've run it for a month and gotten <30s of measure downtime and declared victory, vs run it for three years with copious software updates.
I always assume most people claiming five nines are just not measuring it correctly, or have not exercised the full set of things that will go wrong over a long enough period of time (dc failures, network partitions, config errors, bad network switches that drop only UDP traffic on certain ports, erroneous ACL changes, bad software updates, etc etc)
Maybe they did it all correct though, in which case, yea, seems hard hitting to me.
Here's a reason not in your list.
Short version: A kind of peer pressure, but from above. In some circles I'm told a developer must have AI skills on their resume now, and those probably need to be with well known subscription services, or they substantially reduce their employment prospects.
Multiple people I know who are employers have recently, without prompting, told me they no longer hire developers who don't use AI in their workflow.
One of them told me all the employers they know think "seniors" fall into two camps, those who are embracing AI and therefore nimble and adaptive, and those who are avoiding it and therefore too backward-looking, stuck-in-their-ways to be a good hire for the future. So if they don't see signs of AI usage on a senior dev's resume now, that's an automatic discard. For devs I know laid off from an R&D company where AI was not permitted for development (for IP/confidentiality reasons), that's unfair as they were certainly not backward-looking people, but the market is not fair.
Another "business leader" employer I met recently told me his devs are divided into those who are embracing AI and those who aren't, said he finds software feature development "so slow!", and said if it wasn't for employment law he'd fire all his devs who aren't choosing to use AI. I assume he was joking, but it was interesting to hear it said out loud without prompting.
I've been to several business leadership type meetups in recent months, and it seems to be simply assumed that everyone is using AI for almost everything worth talking about. I don't think they really are, so it's interesting to watch that narrative playing out.
Why does it sound like you're on drugs? I know that sounds extremely rude, but I can't think of any other reasonable comparison for that language.
It's hard to take these kinds of endorsements seriously when they're written so hyperbolically, in terms of the same cliches, and focused on entirely on how it makes you feel rather than what it does.
People claim that DoorDash and other similar apps are about efficiency, but I suspect a large portion is also a desire to remove human interaction. LLMs are the same. Or, in actuality, to create a simulacrum of human interaction that is satisfying enough.
Imagine being an Uber driver and suddenly have to switch to a rickshaw for several hours.
This has basically been what all of Silicon Valley sounds like to me for a few years now.
They are known for abusing many psycho-stimulants out there. The stupid “manifesto” Marc Andreessen put out a while back sounded like adderall-produced drivel more than a coherent political manifesto.
This is similar to how we have already found hacks in our evolutionary programming to directly deliver high amounts of flavor without nutrition, and we've been working on ever more complex means of delivering social stimulation without the need for other human (one of the key appeals of AI for many people, as well).
Of course these are all the ravings of a crank and should be ignored.
What I don’t understand, are the people who let it go over night or with whole “agent teams” working on software. I have no idea how they trust any of it.
Code is notation, just like music sheets, or food recipes. If your interaction with anyone else is with the end result only (the software), the. The code does not matter. But for collaboration, it does. When it’s badly written, that just increase everyone burden.
It’s like forcing everyone to learn a symphony with the record instead of the sheets. And often a badly recorded version.
Do you think that is impossible? There are plenty of people who enjoy composing music on things like trackers, with no intent of ever playing said music on an instrument.
I love coding, but I also like making things, and the two are in conflict: When I write code for the sake of writing code, I am meticulous and look for perfection. When I make things, I want to move as fast as possible, because it is the end-product that matters.
There is also a hidden presumption in what you've written that 1) the code will be badly written. Sometimes it is, but that is the case for people to, but often it is better than what I would produce (say, when needing to produce something in a language I'm not familiar enough with), 2) and that the collaboration will be with people manually working on the code. That is increasingly often not true.
I struggle to understand that comparison. Code is notation, you can’t write code for the sake of writing code. You have a problem and you instruct the computer how to do it. And for the sake of your collaborator and your futher self, you take care of how you write that. There’s no real distinction IMO.
> There is also a hidden presumption in what you've written that 1) the code will be badly written
The computers does not really care about what programming language you’re using and the name of your variables and other indentifiers. People do. You can have correct code (decompiled assembly or minified JavaScript) and no one will wants to collaborate on that.
Code is often the most precise explanation of some process. By being formal, it’s a truthful representation of the process. Specs and documentations can describe truth, but they do not embody it.
You can always collaborate with markdown files. But eventually someone will have to look at the code and understand what it does, because that’s the truth that matters. Anything else is prayers and hope. And if you’ve never cared about maintainability and quality of the code, it will probably be an arduous process.
Isn't this almost certainly against ToS, at least if you're using "plans" (as opposed to paying per-token)?
I still use more traditional approach for finding bugs and other issues in my code, but the agentic workflow doesn't give me any net value.
As an example, a long term goal at the employer I work for is exactly this: run LLMs locally. There's a big infrastructure backlog through, so it's waiting on those things, and hopefully we'll see good local models by then that can do what Claude Sonnet or GPT-5.3-Codex can do today.
It's why we pay stupid amounts for takeout when it's a button away, it's why we accept the issues that come with online dating rather than breaking the ice outside, it's why there's been decades scams that claim to get you abs without effort...
LLMs are the ultimate friction removal. They can remove gaps or mechanical work that regular programming can, but more importantly they can think for you.
I'm convinced this human pattern is as dangerous as addiction. But it's so much harder to fight against, because who's going to be in favor of doing things with more effort rather than less? The whole point of capitalism is supposed to be that it rewards efficiency.
Aw hell. You found my vice and my own cognitive dissonance here. If I want to truly stand by my convictions, I should probably cook more and log off. Waiting for signs that the tides are turning and that people are beginning to value a slower, more methodical approach again isn't doing anything in the current moment to stave off the genuine feelings of dread that have honestly led to some suicidal ideation.
(this is serious and not sarcasm, by the way)
By which I mean, it's likely you're not the only one feeling that dread. We're due for a counter movement, and it's a matter of time to see it flower.
It would be cool to run SOTA models on my own hardware but I can't. Hence, the subscription.
We're paying for servers that sit idle at night, you don't find enough sysadmins for the current problems, the open source models aren't as strong as closed source, providing context (as in googling) means you hook everything up to the internet anyway, where do you find the power and the cooling systems and the space, what do you do with the GPUs after 3 years?
Suddenly that $500/month/user seems like a steal.
Maybe in 5 years we'll have an open weights model that is in the "good enough" category that I can run on a RTX 9000 for 15k dollars or whatever.
Lately though the RAM crisis is continuing and making things like this more unfeasible. But you can still use a lot of smaller models for coding and testing tasks.
Planning tasks I'd use a cloud hosted one, for now, because gemma4 isn't there yet and because the GPU prices are still quite insane.
The cool and fun part is that with ollama and vllm you can just build your own agentic environment IDE, give it the tools you like, and make the workflow however you like. And it isn't even that hard to do, it just needs a lot of tweaking and prompt fiddling.
And on top of that: Use kiwix to selfhost Wikipedia, stackoverflow and devdocs. Give the LLM a tool to use the search and read the pages, and your productivity is skyrocketing pretty quickly. No need anymore to have internet, and a cheap Intel NUC is good enough for self-hosting a lot of containers already.
Source: I am building my own offline agentic environment for Golang [1] which is pretty experimental but sometimes it's also working.
The LLM bit though, personally, is just not for me.
0 as of this writing, it's noticeable. Lots of "should I continue?" And "you should run this command if you want to see that information." Roadblocks that I hadn't seen in a year+
But as it stands, the more likely reason is capacity crunch caused by a chips shortage and demand heavily outpacing supply. You vibe coding reason is based on as much vibes as their code probably is.
I recently vibe-translated a simple project from Javascript to C, where Javascript was producing 30fps, and the first C version produced 1 frame every 20 seconds. After some time trying to get the AI to optimize it, I arrived at 1fps from the C project. Not a win, but the AI did produce working C code.
I have no doubt that if I had done this myself (which I will do soon), with the appropriate level of care, it would have been 30fps or more.
That means they are going to be far more constrained infrastructurally than some of the competition. I think this is some of the constraints that we are seeing.
They don't have compute because they didn't play the game and get the good rates a couple of years ago, and are now forced to work with third-rate providers. That's not a strategy.
I would take everything he says with a huge grain of salt.
[0] “We’re buying a lot. We’re buying a hell of a lot. We’re buying an amount that’s comparable to what the biggest players in the game are buying.”
“Profitability is this kind of weird thing in this field. I don’t think in this field profitability is actually a measure of spending down versus investing in the business.”
[1] “You don’t just serve the current models and never train another model, because then you don’t have any demand because you’ll fall behind.”
So he's not spending so they can be profitable, AND spending as much as the biggest players are spending, AND not really looking at profit as a measure of anything? K.
they're looking to IPO in 2028 vs 2030 for OpenAI, who have raised more than double the funds
so they're willing to play fast and loose with the terms and conditions of existing customers trying to make it happen
those pockets must be drying up really fast
Codex shines really well at what I call "hard problems." You set thinking high, and you just let it throw raw power at the problem. Whereas, Claude Code is better at your average day-to-day "write me code" tasks.
So the difference is kind of nuanced. You kind of need to use both a while to get a real sense of it.
They’re still doing subscriptions: https://developers.openai.com/codex/pricing
There was a headline saying they were, and the actual article showed they were doing nothingbof the sort.
If you read HN headlines, and don't even bother to click into the comments and see everyone calling out the headline as bogus, you might think something like your statement is true.
Edit: Looks like it still works with subs, they just measure usage per token instead of per message.
Before a Subscription was the cheapest way to gain Codex usage, but now they've essentially having API and Subscription pricing match (e.g. $200 sub = $200 in API Codex usage).
The only value of a subscription now is that you get the web version of ChatGPT "free." In terms of raw Codex usage, you could just as easily buy API usage.
edit: This is currently rolled out for Enterprise, but is coming to Pro/Plus soon. The people below saying "I haven't had this issue" haven't yet*.
I don't think it's made out like that, I'm on the ChatGPT Pro plan for personal usage, and for a client I'm using the OpenAI API, both almost only using GPT 5.4 xhigh, done pretty much 50/50 work on client/personal projects, and clients API usage is up to 400 USD right now after a week of work, and ChatGPT Pro limit has 61% left, resets tomorrow.
Still seems to me you'd get a heck more out of the subscription than API credits.
In the future, open models and cheaper inference could cover the loss-leading strategies we see today.
They just rolled it out for new subscribers and existing ones will be getting it in the "coming weeks." Enterprise already got hit with this from my understanding.
Day 1: 2
Day 2: 3
Day 3: 1
Not sure how I can hit such limits so quickly with such low scores on its own chart.
Pentagon: No
OpenAI: We are okay if the line is merely a suggestion and we encourage you not to cross it!
Pentagon: Yes we pick that option
That has led to a significant number of people switching over from openai, or at least stating they were going to do so.
I have cancelled my subscription last week, I'll see them when they fix this nonesense
For some context, they added 2x Palantir or .75x Shopify or .68x Adobe annual revenue in March alone.
Fwiw there are worse delays from second tier providers like moonshot's kimik2.5 that are also popular for agentic use.
Vibe coding doesn't automatically mean lower quality. My codebase quality and overall app experience has improved since I started using agents to code. You can leverage AI to test as well as write new code.
> I assume most of their outages is related to this insane scaling and lack of available compute.
>
> Vibe coding doesn't automatically mean lower quality
Scalability is a factor of smart/practical architectural decisions. Scalability doesn't happen for free and isn't emergent (the exact opposite is true) unless it is explicitly designed for. Problem is that ceding more of the decision making to the agent means that there's less intentionality in the design and likely a contributor to scaling pains.You are talking about software scaling patterns, Anthropic is running into hardware limitations because they are maxing out entire datacenters. That's not an architectural decision it's a financial gamble to front-run tens of billions in capacity ahead of demand.
> What exactly are emergent features when vibe coding?
Regression to the mean. See the other HN thread[0]The LLM has no concept of "taste" on its own.
Scalability, in particular, is a problem that goes beyond the code itself and also includes decisions that happen outside of the codebase. Infrastructure and "platform" in particular has a big impact on how to scale an application and dataset.
[0] https://dornsife.usc.edu/news/stories/ai-may-be-making-us-th...
Personally I write something like 80-90% of my code with agents now but after they finish up, it's critical that you spin up another agent to clean up the code that the first one wrote.
Looking at their code it's clear they do not do this (or do this enough). Like the main file being something like 4000 LOC with 10 different functions all jammed in the same file. And this sort of pattern is all over the place in the code.
It’s great to buy dollars for a penny, but the guy selling em is going to want to charge a dollar eventually…
Do you feel there is enough visibility and stability around the "Prompt -> API token usage" connection to make a reliable estimate as to what using the API may end up costing?
Personally, it feels like paying for Netflix based on "data usage" without having anyway for me to know ahead of time how much data any given episode or movie will end up using, because Netflix is constantly changing the quality/compression/etc on the fly.
I agree that ex ante it’s tough, and they could benefit from some mode of estimation.
Perhaps we can give tasks sizes, like T shirts? Or a group of claudes can spend the first 1M tokens assigning point values to the prospective tasks?
Take the response on another post about Claude Code.
https://news.ycombinator.com/item?id=47664442
This reads like even if you had a rough idea today about what usage might look like, a change deployed tomorrow could have a major impact on usage. And you wouldn't know it until after you were already using it.
Now we’re going to find out what these tools are really worth.
Of course, I have no idea how MS is justifying the Copilot pricing. I can't imagine any world in which it is sustainable, so I'm trying to get as much as I can out of it now before they jack up prices.
It works out even if some customers are able to eat a lot, because people on average have a certain limit. The limits of computers are much higher.
If an hour of an excellent developer's time is worth $X, isn't that the upper bound of what the AI companies can charge? If hiring a person is better value than paying for an AI, then do that.
They can charge whatever they want, I think many people like to make business decisions based on relative predictability or at least be more aware that there's a risk. If they want it to be "some weeks you have lots of usage, some weeks less, and it depends on X factors, or even random factors" then people could make a more informed choice. I think now it's basically incredibly vague and that works while it's relatively predictable, and starts to fail when it's not, for those that wanted the implied predictability.
I'm not sure how businesses budget for llm APIs, as they seem wildly unpredictable to me and super expensive, but maybe I'm missing something about it.
So I noticed the model is purposefully coming with dumb ideas or running around in circles and only when you tell it that they are trying to defraud you, they suddenly come back with a right solution.
1. Me not wanting that for context management reasons
2. It burning tokens on an expensive model.
Literally a conversation that I just had:
* ME: "Have sonnet background agent do X"
* Opus: "Agent failed, I'll do it myself"
* Me: "No, have a background agent do it"
* Opus: Proceeds to do it in the foreground
* Flips keyboard
This has completely broken my workflows. I'm stuck waiting for Opus to monitor a basic task and destroy my context.
We'll see AI chat replace Google, we'll see companies adopting AI in high-value areas, and we'll see local models like Gemma 4 get used heavily.
AI winter will see a disappearance of the clickbait headlines about everyone losing their jobs. Literally nobody is making those statements taking into account that pricing to this point is way less than the profit maximizing level.
At my workplace we have been sticking with older versions, and now stick to the stable release channel.
Is Microsoft (one of the largest companies in the world) really a victim of brand death?
prompts. tool calling quirks. evals. auth. retries. all the weird failure modes your team already paid to learn.
The rest of the organisation, which is not software development or IT related, mainly uses GPT models. I just wish I hadn't taught risk management about claude code so they weren't wasting MY tokens.
Obviously in hindsight it would be unfair to Anthropic to judge them on an unstable day so I'l leave those complaints aside but I hit the session limit way too fast. I planned out 3 tasks and it couldn't finish the first plan completely, for that implementation task it has seen a grand total of 1 build log and hasn't even run any tests which already caused it to enter in the red territory of the context circle.
It was even asking me during planning which endpoints the new feature should use to hook into the existing system, codex would never ask this and just simply look these up during planning and whenever it encounters ambiguity it would either ask straight away or put it as an open question. I have to wonder if they're limiting this behavior due trying to keep the context as small as possible and preventing even earlier session limits.
Maybe codex's limits are not sustainable in the long run and I'm very spoiled by the limits but at this point CC(sonnet) and Codex(5.4) are simply not in the same league when comparing both 20 dollar subscriptions.
I will also clearly state that the value both these tools provide at these price points are absolutely worth it, it's just that codex's value/money ratio is much better.
CC is a better implementation and seems to be fairly economic with token usage. That is the really the only defining point and, I suspect, Anthropic are going to have a lot of trouble staying relevant with all the product issues.
They were far ahead for a brief period in November/December which is driving the hype cycle that now appears to be collapsing the company.
You have to test at least every month, things are moving quickly. Stepfun is releasing soon and seems to have an Opus-level model with more efficient architecture.
One example is I have a multi-stage distillation/knowledge extraction script for taking a Discord channel and answering questions. I have a hardcoded 5k message test set where I set up 20 questions myself based on analyzing it.
In my harness Minimax wasn't even getting half of them right, whereas Sonnet was 100%. Granted this isn't code, but my usage on pi felt about the same.
What are you using to drive the Chinese models in order to evaluate this? OpenCode?
Some of Claude Code's features, like remote sessions, are far more important than the underlying model for my productivity.
CC tool usage is also significantly ahead imo (doesn't negate the price but it is something). I have seen issues with heavy thinking models (like Minimax) and client implementations with poor tool usage (like Cline).
CC has had a period over the last six months of delivering significant value...but, of course, you can just use CC with OpenRouter.
I keep coming back to it because I can run it as a manager for the smaller tasks.
Agentic workflows do have a place in well-defined, structured tasks...but I don't think that is what most people are trying to do with it.
There was constant drama with CC. Degradation, low reliability, harness conspiring against you, and etc – these things are not new. Its burgeoning popularity has only made it worse. Anthropic is always doing something to shoot themselves in the foot.
The harness does cool things, don't get me wrong. But it comes with a ton of papercuts that don't belong in a professional product.
Free and local.
Unless they meant "all code that needs to be written has already been written" so their mission is to prevent any new code from being written via a kind of a bait and switch?
I think Anthropics model has conflict of interest. They seem to have nerfed the models so that it takes more iterations to get the result (and spend more money) than it used to where e.g. Opus would get something right first time.
I’ve been toying around at home with it and I’ve been fine with its output mostly (in a Java project ofc), but I’ve run into a few consistent problems
- The thing always trips up validating its work. It consistently tries to use powershell in a WSL environment I don’t have it installed in. It also seems to struggle with relative/absolute paths when running commands.
- Pricing makes no sense to me, but Jetbrains offering seems to have its own layer of abstraction in “credits” that just seem so opaque.
Then again, I mostly use this stuff for implementing tedious utilities/features. I’m not doing entity agent written and still do a lot of hand tweaks to code, because it’s still faster to just do it myself sometimes. Mostly all from all from the IDE still.
Not worth the money now, will be canceling unless fixed soon.
Maybe you should consider....local models instead?
I doubt even the core engineers know how to begin debugging that spaghetti code.
I have no idea how people are hitting the limits so fast.
Hit the weekly limit on my 20x plan last week trying to do a full front end rewrite of a giant enterprise web app, 600+ html templates, plus validating every single one with playwright.