The models were mostly GPT-5 and Claude Sonnet 4. The study was too early to catch the 5.x Codex or Claude 4.5 models (bar one mention of Sonnet 4.5.)
This is notable because a lot of academic papers take 6-12 months to come out, by which time the LLM space has often moved on by an entire model generation.
This is a recurring argument which I don't understand. Doesn't it simply mean that whatever conclusion they did was valid then? The research process is about approximating a better description of a phenomenon to understand it. It's not about providing a definitive answer. Being "an entire model generation" behind would be important if fundamental problems, e.g. no more hallucinations, would be solved but if it's going from incremental changes then most likely the conclusions remain correct. Which fundamental change (I don't think labeling newer models as "better" is sufficient) do you believe invalidate their conclusions in this specific context?
Just the jump from Sonnet 3.5 to 3.7 to 4.5, and Opus 4.5 has been pretty massive in terms of holistic reasoning, deep knowledge as well as better procedural and architectural adherence.
GPT-5 Pro convinced me to pay $200/mo for an OpenAI subscription. Regular 5.2 models, and 5.2 codex, are leagues better than GPT-4 when it comes to solving problems procedurally, using tools, and deep discussion of scientific, mathematic, philosophical and engineering problems.
Models have increasingly longer context, especially some Google models. OpenAI has released very good image models, and great editing-focused image models in general have been released. Predictably better multimodal inference over the short term is unlocking many cool near-term possibilities.
Additionally, we have seen some incredible open source and open weight models released this year. Some fully commercially viable without restriction. And more and more smaller TTS/STT projects are in active development, with a few notable releases this year.
Honestly, the landscape at the end of the year is impressive. There has been great work all over the place, almost too much to keep up with. I'm very interested in the Genie models and a few others.
For an idea:
At the beginning of the year, I was mildly successful getting at coding models to make changes in some of my codebases, but the more esoteric problems were out of reach. Progress in general was deliberate and required a lot of manual intervention.
By comparison, in the last week I've prototyped six applications at levels that would take me days to weeks individually, often developing multiple at the same time, monitoring agentic workflows and intervening only when necessary, relying on long preproduction phases with architectural discussions and development of documentation, requirements, SDDs... and detailed code review and refactoring processes to ensure adherence to constraints. I'm morphing from a very busy solo developer into a very busy product manager.
I don't really agree. Aside from how it handled frontend code, changes in Sonnet did not truly impact my overall productivity (from Sonnet 3.7 to 4 to 4.5, i did not try 3.5). Opus 4.5/Codex 5.2 are when the changes truly happenned for me (and i'm still a bit distrustfull of Codex 5.2, but i use it basically to help me during PRs).
I don't doubt that the models have got better, but you can go back two or three years and find people saying the exact same stuff about the latest models back then.
Results are getting worse and less accurate, hell, I even had Claude drop some Chinese into a response out of the blue one day.
We now have extremely large context windows, we now have memory, we now have recall, we now can put an agent to the task for 24 hours.
Certainly some scientists are just absurdly efficient and all 28 involved teams, but that’s still a lot.
Personally speaking, this gives me second thoughts about their dedication to truly accurately measuring something as notoriously tricky as corporate SWE performance. Any number of cut corners in a novel & empirical study like this would be hard to notice from the final product, especially for casual readers…TBH, the clickbait title doesn’t help either!
I don’t have a specific critique on why 4 months is definitely too short to do it right tho. Just vibe-reviewing, I guess ;)
Off your intuition, do you think the same study with Codex 5.2 and Opus 4.5 would see even better results?
If people are really set in their ways, maybe they won't try anything beyond what old models can do, and won't notice a difference, but who's had time to get set in their ways with this stuff?
It’s still hit or miss. The product “worked” when I tested it as a black box, but the code had a lot of rot in it already.
Maybe that stuff no longer matters. Maybe it does. Time will tell.
It's fine if you want Claude to design your API's without any input, but you'll have less control and when you dig down into the weeds you'll realise it's created a mess.
I like to take both a top-down and bottoms-up approach - design the low level API with Claude fleshing out how it's supposed to work, then design the high level functionality, and then tell it to stop implementing when it hits a problem reconciling the two and the lower level API needs revision.
At least for things I'd like to stand the test of time, if its just a throwaway script or tool I care much less as long as it gets the job done.
The latest models and harnesses can crunch on difficult problems for hours at a time and get to working solutions. Nothing could do that back in ~March.
I shared some examples in this comment: https://news.ycombinator.com/item?id=46436885
Every single example you gave is in a hobby project territory. Relatively self-contained, maintainable by 3-4 devs max, within 1k-10k lines of code. I've been successfully using coding agents to create such projects for the past year and it's great, I love it.
However, lots of us here work on codebases that are 100x, 1000x the size of these projects you and Karpathy are talking about. Years of domain specific code. From personal experience, coding agents simply don't work at that scale the same way they do for hobby projects. Over the past year or two, I did not see any significant improvement from any of the newest models.
Building a slightly bigger hobby project is not even close to making these agents work at industrial scale.
The problem is that everyone working on those more serious projects knows that and treats LLMs accordingly, but the people that come from the web space come in with the expectation that they can replicate the success they have in their domain just as easily, when oftentimes you need to have some domain knowledge.
I think the difference simply comes down to the sheer volume of training material, i.e. web projects on github. Most "engineers" are actually just framework consumers and within those frameworks llms work great.
Again, that codebase is millions of lines of Python code and frankly the agents weren't as good then as they are now. I carefully used globbing rules in Cursor to navigate coding and testing standards. I had a rule that functioned as how people use agents.md now, which was put on every prompt. That honestly got me a lot more mileage than you'd think. A lot of the outcomes of these tools are how you use them and how good your developer experience is. If professional software engineers have to think about how to navigate and iterate on different parts of your code, then an LLM will find it doubly difficult.
Those crunching hard problems will still review what's produced in search of issues.
(This helps me understand better the people who are confused/annoyed/dismissive about it, because I remember how dismissive people were about Node, about Docker, about Postgres, about Linux when those things were new too. So many arguments where people would passionately talk about all those things were irredeemably stupid and only suitable for toy/hobby projects.)
It takes about 6 months to figure out how to get LaTeX to position figures where you want them, and then another 6 months to fight with reviewers
I've seen people unable to work at average speed on small features suddenly reach above average output through a llm cli and I could sense the pride in them. Which is at odds with my experience of work.. I love to dig down, know a lot, model and find abstractions on my own. There a llm will 1) not understand how my brain work 2) produce something workable but that requires me to stretch mentally.. and most of the time I leave numb. In the last month I've seen many people expressing similar views.
ps: thanks everybody for the answers, interesting to read your pov
I am a very above-average engineer when it comes to speed at completing work well, whether that's typing speed or comprehension speed, and still these tools have felt like giving me a jetpack for my mind. I can get things done in weeks that would have taken me months before, and that opens up space to consider new areas that I wouldn't have even bothered exploring before because I would not have had the time to execute on them well.
1. I do love getting into the details of code, but I don't mind having an LLM handle boilerplate.
2. There isn't a binary between having an LLM generate all the code and writing it all myself.
3. I still do most of the design work because LLMs often make questionable design decisions.
4. Sometimes I simply want a program to solve a problem (outcome-focused) over a project to work on (craft-focused). Sometimes I need a small program in order to focus on the larger project, and being able to delegate that work has made it more enjoyable.
My usual thought is that boilerplate tells me, by existing, where the system is most flawed.
I do like the idea of having a tool that quickly patches the problem while also forcing me to think about its presence.
> There isn't a binary between having an LLM generate all the code and writing it all myself. I still do most of the design work because LLMs often make questionable design decisions.
One workflow that makes sense to me is to have the LLM commit on a branch; fix simple issues instead of trying to make it work (with all the worry of context poisoning); refactor on the same branch; merge; and then repeat for the next feature — starting more or less from scratch except for the agent config (CLAUDE.md etc.). Does that sound about right? Maybe you do something less formal?
> Sometimes I simply want a program to solve a purpose (outcome-focused) over a project to work on (craft-focused). Sometimes I need a small program in order to focus on the larger project, and being able to delegate that work has made it more enjoyable.
Yeah, that sounds about right.
I’ve realized that a lot of my coding is on this personal satisfaction vs utility matrix and llms let me focus a lot more energy onto high satisfaction projects
As a (self-reported) craft-and-decomposition lover, I wouldn't call the process "fast".
Certainly it's much faster than if I were trying to take the same approach without the same skills; and certainly I could slow it down with over-engineering. (And "deep" absolutely fits.) But the people I've known that I'd characterize as strongly "outcome-only", were certainly capable of sustaining some pretty high delta-LoC per day.
At least in the olden days[1] you could write code for days before compiling, which reduced the pain. Long compilation times has always been awful, but it is less frustrating when you could defer it until the next blue moon. LLMs don't (yet) seem to be able to handle that. If you feed them more than small amounts of code at a time they quickly go off the rails.
With that said, while you could write large amounts of code and defer it until the next blue moon, it is a skill to be able to do that. Even in C++, juniors seem to like to write a few lines of code and then turn to compiling the results to make sure they are on the right track. I expect that is the group of people who is most feeling at home with LLMs. Spending a few minutes writing code and then waiting on compilation isn't abnormal for them.
But presumably the tooling will improve with time.
Businesses are currently willing to accept that lack of productivity as an investment into figuring out how to tame the tools. There is a lot of hope that all the problems can be solved if we keep trying to solve them. And, in fairness, we have gotten a lot closer than we were just a year or so ago towards that end, so the optimism currently remains strong. However, that cannot go on forever. At some point the investment has to prove itself, else the plug will be pulled.
And yes, it may ultimately be a dead end. Absolutely. It wouldn't be the first failure in software development.
As a manager/tech-lead, I've kind of been a tech priest for some time.
With no gossip, rivalry or backstabbing. Super polite and patient, which is very inspiring.
We also brutally churning them by "laying off" the previously latest model once the new latest is available.
Flow is effortless. and it is rejuvenating.
I believe:
While communication can be satisfying, it’s not as rejuvenating as resting in our own Being and simply allowing the action to unfold without mental contraction.
Flow states.
When the right level of challenge and capability align and you become intimate with the problem. The boundaries of me and the problem dissolve and creativity springs forth. Emerging satisfied. Nourished.
Thinking is a skill that is reinforced by reading, designing and writing code. When you outsource your thinking to an LLM your ability to think doesn’t magically improve…it degrades.
And if you let the AI too loose, as when you try to vibe code an entirely new program, I end up in the situation where in 1 day I have a good prototype and then I can spend easily 5 times as much sorting the many issues and refactoring in order to have it scale to the next features.
Long iteration cycles are taxing
But it does feel less fulfilling I suppose.
That is, unquestionably, how it ought to be. However, the mainstream – regrettably – has devolved into a well-worn and intellectually stagnant trajectory, wherein senior developers are not merely encouraged but expected to abandon the coding altogether, ascending instead into roles such as engineering managers (no offence – good engineering managers are important, it is the quality that has been diluted across the board), platform overseers (a new term for stage gate keepers), or so-called solution architects (the ones who are imbued with compliance, governance and do not venture out past that).
In this model, neither role is expected – and in some lamentable cases, is explicitly forbidden[0] – to engage directly with code. The result is a sterile detachment from the very systems they are charged with overseeing.
Worse still, the industry actively incentivises ill-considered career leaps – for instance, elevating a developer with limited engineering depth into the position of a solution designer or architect. The outcome is as predictable as it is corrosive: individuals who can neither design nor architect.
The number of organisations in which expert-level coding proficiency remains the norm at senior or very senior levels has dwindled substantially over the past couple of decades or so – job ads explicitly call out the management experience, knowledge of vacuous or limited usefulness architectural frameworks (TOGAF and alike). There do remain rare islands in an ever-expanding ocean of managerial abstraction where architects who write code, not incessantly but when a need be, are still recognised as invaluable. Yet their presence is scarce.
The lamentable state of affairs has led to a piquant situation on the job market. In recent years, headhunters have started complaining about being unable to find an actually highly proficient, experienced, and, most importantly, technical architect. One's loss is another one's gain, or at least an opportunity, of course.
[0] Speaking from firsthand experience of observing a solution architect to have quit their job to run a bakery (yes) due to the head of architecture they were reporting to explicitly demanding the architect quit coding. The architect did quit, albeit in a different way.
Strongly suspect this is simply less efficient than doing it yourself if you have enough expertise.
> Number of Survey Respondents
> Building apps 53
> Testing 1
I think this sums up everybody complaints about AI generated code. Don't ask me to be the one to review work you didn't even check.
We're in the midst of another abstraction level becoming the working layer - and that's not a small layer jump but a jump to a completely different plane. And I think once again, we'll benefit from getting tools that help us specify the high level concepts we intend, and ways to enforce that the generated code is correct - not necessarily fast or efficient but at least correct - same as compilers do. And this lift is happening on a much more accelerated timeline.
The problem of ensuring correctness of the generated code across all the layers we're now skipping is going to be the crux of how we manage to leverage LLM/agentic coding.
Maybe Cursor is TurboPascal.
I think open source is the single most important productivity boost to our industry that's ever existed. Automated testing is a close second.
Google, Facebook, many others would not have existed without open source to build on.
And those giants and others like them that were enabled by open source employed a TON of people, at competitive rates that greatly increased our salaries.
I've seen VB namedropped frequently, but I feel like I've yet to see a proper discussion of why it seems like nothing can match its productivity and ease of use for simple desktop apps. Like, what even is the modern approach for a simple GUI program? Is Electron really the best we can do?
MS Access is another retro classic of sorts that, despite having a lot of flaws, it seems like nothing has risen to fill its niche other than SaaS webapps like airtable.
This is a nice video on why Electron is the best you might be able to do.
My memory of it is very fuzzy, but I recall VB being literally drag-and-drop, and yet still being able to make... well, acceptable UIs. I was able to figure it out just fine in middle school.
In comparison, here's Electron's getting started page: https://www.electronjs.org/docs/latest/ The "quick start" is two different languages across three different files. The amount of technologies and buzzwords flying around is crazy, HTML, JS, CSS, Electron, Node, DOM, Chromium, random `charset` and `http-equiv` boilerplate... I have to imagine it'd be rather demoralizing as a beginner. I think there's a large group of "nontechnical" users out there (usually derided by us tech bros as "Excel programmers" or such) that can perfectly understand the actual logic of programming, but are put off by the amount of buzzwords and moving parts involved, and I don't blame them at all.
(And sure, don't want to go in too hard on the nostalgia. 2000s software was full of buzzwords and insane syntax too, we've improved a lot. But it had some upsides.)
It just feels like we lost the plot at some point when we're all using GUI-based computers, but there's no simple, singular, default path to making a desktop GUI app anymore on... any, I think, of the popular desktop OSes?
If that were too, wouldn't we all be using VB today?
everything old is new again
What one can make with VB6 (final release in 1998) is very far from what can make with modern stacks. (My efficiency at building LEGO structures is unbelievable! I put the real civil engineers to shame.)
Perhaps you mean that you can go from idea to working (in the world and expectations of 1998) very quickly. If so, that probably felt awesome. But we live in 2025. Would you reach for VB6 now? How much credit does VB6 deserve? Also think about how 1998 was a simpler time, with lower expectations in many ways.
Will I grant advantages to certain aspects of VB6? Sure. Could some lessons be applicable today? Probably. But just like historians say, don't make the mistake of ignoring context when you compare things from different eras.
With AI at least locally I'm seeing the opposite now - less hiring, less wage pressure and in social circles a lot less status when I mention I'm a SWE (almost sympathy for my lot vs respect only 5 years ago). While I don't care for the status aspect, although I do care for my ability to earn money, some do.
At least locally inflation adjusted in my city SWE wages bought more and were higher in general compared to others in the 90's-2000's than on wards (ex big tech). Partly because this difficulty and low level knowledge meant only very skilled people could participate.
I mean, this seems like a pretty big thing to leave out, no? That's where all the crazy high salaries were!
Also, there are still legacy places that more or less build software like it's 1999. I get the impression that embedded, automotive, and such still rely a lot on proprietary tools, finicky manual processes, low level languages (obviously), etc. But those are notorious for being annoying and not very well paid.
Making software easier and more abstract has allowed less technical people into the profession, allowed easier outsourcing, meant more competition/interview prep to filter out people (even if the skills are not used in the job at all), more material for AI to train on, etc. To the parent comment's point I don't think it has boosted salaries and/or conditions on average for the SWE - in the long run (10 years +) it could be argued that economically the opposite has occurred.
Otherwise people would have realized they can charge 3x as much by being 5x as productive with better tools while you're writing your code in notepad for maximum ROI, and you would have either adjusted or gone out of business.
Increased productivity isn't a choice, it's a result of competition. And that's a good thing overall, even if it sucks for some developers who now have to actually work for the first time in decades. But it's good for society at large, because more things can be done.
Also your notion of "better tools" may of not happened, or happened more slowly without open source, AI, etc which would of meant higher salaries for longer most probably. That's where I disagree with the parent poster's claim of higher salaries - AI seems to be a great recent example of "better tools" disrupting the premium SWE's enjoy rather than improving their salaries. Whether that's fair or not is a different debate.
I was just doubting the notion of the parent comment that "open source software" and "automated testing" create higher salaries. Usually efficiency economically (some exceptional cases) creates lower salaries for the people who are made more efficient all else being equal - and the value shifts from them to either consumers or employers.
I'd generally agree with that if it regards to safety (e.g. industrial control systems), but we manage that by certifying the manufacturer, not the individual developer. But otherwise I think it's harmful to society, even if beneficial to the individuals - but there's a lot of things falling in that bucket, and it's usually not the things we strive for at a societal level.
In my experience, getting better and faster has always translated into being paid more. I don't know that there's a direct relationship to specific tools, but I'm pretty sure that the mainstreaming of software development has caused the huge inflation of total comp that you see in many companies. If it was slow and there's only this handful of people that can do it, but they're not really adding a huge amount of value, you wouldn't be seeing that kind of multiplier vs the average job.
Should we be trying to put the genie back in the bottle? If not, what exactly are you suggesting?
Even if we all agreed to stop using AI tools today, what about the rest of world? Will everybody agree to stop using it? Do you think that is even a remote possibility?
Software Devs not so much.
There is a huge difference between the two and they are not interchangeable.
Your take is this meme https://knowyourmeme.com/memes/dig-the-fucking-hole.
Hence why you have in the same thread, some developer who claims that Claude writes 99% of their code and another developer who finds it totally useless. And of course others who are somewhere in the middle.
It's fine to be a skeptic. Or to have tried out these tools and found that they do not work well for your particular use case at this moment in time. But you shouldn't assume that people who do get value out of them are not as good at the job as you are, or are dumber than you are, or slower than you are. That's just not a good practice and is also rude.
I'm just saying that since there is such a wide range of experiences with the same tools, it's probably likely that developers vary on their evaluations of the output.
Too many people are invested into AI's success to have a balanced conversation. Things will return to normal after a market shakedown of a few larger AI companies.
Owning the infrastructure and enshittify (ads) once enough products are based on AI.
Its the same chokehold Amazon has on its Vendors.
This has nothing to do with deployment. I never talked about deployment.
I. Don't. Care.
I don't even care about those debates outside. Debates about do LLM work and replace programmers? Say they do, ok so what?
I simply have too much fun programming. I am just a mere fullstack business line programmer, generic random replaceable dude, you can find me dime a dozen.
I do use LLM as Stack Overflow/docs replacement, but I always code by hand all my code.
If you want to replace me, replace me. I'll go to companies that need me. If there are no companies that need my skill, fine, then I'll just do this as a hobby, and probably flip burgers outside to make a living.
I don't care about your LLM, I don't care about your agent, I probably don't even care about the job prospects for that matter if I have to be forced to use tools that I don't like and to use workflows I don't like. You can go ahead find others who are willing to do it for you.
As for me, I simply have too much fun programming. Now if you excuse me, I need to go have fun.
I'd at least be more likely to get a boost in impact and ability to affect decision making, maybe.
(1) already have enough money to survive without working, or
(2) don't realize how hard of a life it would be to "flip burgers" to make a living in 2026.
We live very good lives as software developers. Don't be a fool and think you could just "flip burgers" and be fine.
I also did dry cleaning, cleaning service, deli, delivery guy, etc.
Yup I now have enough money to survive without working.
But I also am very low maintenance, thanks to my early life being raised in harsh conditions.
I am not scared to go back flipping burgers again.
it's the part where you don't have to work that matters
or something like that
not sure why so many people feel like factoring fun into what job you want to take is so unthinkable, or that it's just a false dichotomy between the ideal job and unemployment
Like I said, I am just a generic replaceable dime a dozen programmer dude.
a job isn't supposed to be fun its nice when it is but it shouldn't be what drives decisions
I meant it can be your (not necessarily your employer) driving decision in life.
Of course, you need to suffer. That's about having tradeoffs.
you can definitely choose not to participate and give the opportunity someone who are happy to use AI and still have fun with it.
but that doesn't mean you can't (or shouldn't) work around it
how do you imagine such conversation to play out im curious
in a past job I did tell a boss that I wasn't going to be doing the whole tickets/estimates/schedule tetris thing, and that actually worked out... because the leaders I worked with understood the value of being flexible and trusting their lead engineers
It could even have been picked up in pretraining and then rewarded during rlhf when the output domain was being refined; I haven’t used enough LLMs before post training to know what step it usually becomes noticeable.
Im in the back-and-forth camp. I expect a lot of interesting UX to develop here. I built https://github.com/backnotprop/plannotator over the weekend to give me a better way to review & collaborate around plans - all while natively integrated into the coding agent harness.
"I’m on disability, but agents let me code again and be more productive than ever (in a 25+ year career). - S22"
Once Social Security Administration learns this, there goes the disability benefit...
I've seen this with code generation tools - developers who treat AI suggestions as magic often struggle when the output doesn't work or introduces subtle bugs. The professionals who succeed are those who understand what the AI is doing, validate the output rigorously, and maintain clear mental models of their system.
This becomes especially important for code quality and technical debt. If you're just accepting AI-generated code without understanding architectural implications, you're building a maintenance nightmare. Control means being able to reason about tradeoffs, not just getting something that "works" in the moment.
Out of curiosity, if I wanted to setup cscope for a bunch of small projects, say dozens of prototypes in their own directory, would it be useful? Too broad?
As with every new tech there's a hell of a lot of noise (plugins, skills, hooks, MCP, LSP - to quote Kaparthy) but most of it can just be disregarded. No one is "behind" - it's all very easy to use.
So essentially what this means is a declarative programming system of overall system behavior.
Do it in the way that makes you feel happy, or conforms to organizational standards.
Well
There’s many contexts in which programming a computer well is not important.
Not a statistically significant sample size.