"In the METR study, developers predicted AI would make them 24% faster before starting. After finishing 19% slower, they still believed they'd been 20% faster."
I hadn't heard of this study before. Seems like it's been mentioned on HN before but not got much traction.
Most people who cite it clearly didn't read as far as the table where METR themselves say:
> We do not provide evidence that:
> 1) AI systems do not currently speed up many or most software developers. Clarification: We do not claim that our developers or repositories represent a majority or plurality of software development work
> 2) AI systems do not speed up individuals or groups in domains other than software development. Clarification: We only study software development
> 3) AI systems in the near future will not speed up developers in our exact setting. Clarification: Progress is difficult to predict, and there has been substantial AI progress over the past five years [3]
> 4) There are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting. Clarification: Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
Their study still shows something interesting, and quite surprising. But if you choose to extrapolate from this specific setting and say coding assistants don't work in general then that's not scientific and you need to be careful.
I think the studyshould probably decrease your prior that AI assistants actually speed up development, even if developers using AI tell you otherwise. The fact it feels faster when it is slower is super interesting.
Being armed with that knowledge is useful when thinking about my own productivity, as I know that there's a risk of me over-estimating the impact of this stuff.
But then I look at https://github.com/simonw which currently lists 530 commits over 46 repositories for the month of December, which is the month I started using Opus 4.5 in Claude Code. That looks pretty credible to me!
Simon - you are an outlier in the sense that basically your job is to play with LLMs. You don't have stakeholders with requirements that they themselves don't understand, you don't have to go to meetings, deal with a team, shout at people, do PRs etc., etc. The whole SDLC/process of SWE is compressed for you.
I liked the way they did that study and I would be interested to see an updated version with new tools.
I'm not particularly sceptical myself and my guess is that using Opus 4.5 would probably have produced a different result to the one in the original study.
It's surprising that it manages the majority of the test cases but not all of them. That's not a very human-like result. I would expect humans to be bimodal with some people getting stuck earlier and the rest completing everything. Fractal intelligence strikes again I guess?
Do you think the way you specified the task at such a high level made it easier for Claude? I would have probably tried to be much more specific for example by translating on a file by file or function by function basis. But I've no idea if this is a good approach. I'm really tempted to try this now! Very inspiring.
Absolutely. The trick I've found works best for these longer tasks is to give it an existing test suite and a goal to get those tests to pass, see also: https://simonwillison.net/2025/Dec/15/porting-justhtml/
In this case ripping off the MicroQuickJS test suite was the big unlock.
I have a WebAssembly runtime demo I need to publish where I used the WebAssembly specification itself, which it turns out has a comprehensive test suite built in as well.
It (along with the hundreds of billions in investments hinging on it), explains the legions of people online who passionately defend their "system". Every gambler has a "system" and they usually earnestly believe it is helping them.
Some people even write popular (and profitable!) blogs about playing slots machines where they share their tips and tricks.
It reminds me of global warming where on one side of the debate there some scientists with very little money running experiments and on the other side there were some ridiculously wealthy corporations publicly poking holes in those experiments but who secretly knew they were valid since the 1960s.
1. There are bajillions of dollars in incentives for a study declaring "Insane Improvements", so we should expect a bunch to finish being funded, launched, and released... Yet we don't see many.
2. There is comparatively no money (and little fame) behind a study saying "This Is Hot Air", so even a few seem significant.
The worse part is reading a PR, and catching a reintroduced bug that was fixed a few commit ago. The first time i almost lost my cool at work and said a negative thing to a coworker.
This would be my advice to juniors (and i mean basically: devs who don't yet understand the underlying business/architecture): use the AI to explain how stuff work, generate basic functions maybe, but write code logic/algorithm yourself until you are sure you understand what you're doing and why. Work and reflect on the data structures by yourself, even if generated by the AI, and ask for alternatives. Always ask for alternatives, it helps understanding. You might not see huge productivity gains from AI, but you will improve first, and then productivity will improve very fast, from your brain first, then from AI.
Losing your cool is never a good idea, but this is absolutely a time when you should give negative feedback to that coworker.
Feedback is what reviews are for; in this case, this aspect of the feedback should neither be positive nor neutral.
Now question is..
is AI providing solutions smarter than the developer using it might have produced?
And perhaps more importantly, How much time it takes AI to write code and human to debug it, even if both are producing equally smart solutions.
* Force the AI to write tests for everything. Ensure those tests function. Writing boring unit tests used to be arduous. Now the machine can do it for you. There's no excuse for a code regression making it's way into a PR because you actually ran the tests before you did the commit, right? Right? RIGHT?
* Force the AI to write documentation and properly comment code, then (this is the tricky part) you actually read what it said it was doing and ensure that this is what you wanted it to do before you commit.
Just doing these two things will vastly improve the quality and prevent most of the dumb regressions that are common with AI generated code. Even if you're too busy/lazy to read every line of code the AI outputs just ensuring that it passes the tests and that the comments/docs describe the behavior you asked for will get you 90% of the way there.
The irony is when company did lay off him due to covid the actual velocity of the team increased.
I agree with the idea, I do it too, but you need to make sure the test don't just validate the incorrect behavior or that the code is not updated to pass the test in a way that actually "misses the point".
I've had this happen to me on one or two tests every time
To give some further advice to juniors: if somebody is telling you writing unit tests is boring, they haven’t learned how to write good tests. There appears to be a large intersection between devs who think testing is a dull task and devs who see a self proclaimed speed up from AI. I don’t think this is a coincidence.
Writing useful tests is just as important as writing app code, and should be reviewed with equal scrutiny.
For some reason Gemini seems to be worse at it than Claude lately. Since mostly moving to 3 I've had it go back and change the tests rather than fixing the bug on what seems to be a regular basis. It's like it's gotten smart enough to "cheat" more. You really do still have to pay attention that the tests are valid.
I think for users this _feels_ incredibly powerful, however this also has its own pitfalls: Any topic which you're incompetent at is one which you're also unequipped to successfully review.
I think there are some other productivity pitfalls for LLMs:
- Employees use it to give their boss emails / summaries / etc in the language and style their boss wants. This makes their boss happy, but doesn't actually modify productivity whatsoever since the exercise was a waste of time in the first place.
- Employees send more emails, and summarize more emails. They look busier, but they're not actually writing the emails or really reading them. The email volume has increased, however the emails themselves were probably a waste of time in the first place.
- There is more work to review all around and much of it is of poor quality.
I think these issues play a smaller part than some of the general issues raised (eg: poor quality code / lack of code reviews / etc.) but are still worth noting.
This is the average software developer's experience of LLMs
This is completely orthogonal to productivity gains for full time professional developers.
However in my experience, the issue with AI is the potential hidden cost down the road. We either have to:
1. Code review the AI generated code line by line to ensure it's exactly what you'd have produced yourself when it is generated or
2. Pay an unknown amount of tech tebt down the road when it inevitably wasn't what you'd have done yourself and it isn't extensible, scalable, well written code.
To get an accurate productivity metric you’d have to somehow quantify the debt and “interest” vs some alternative. I don’t think that’s possible to do, so we’re probably just going to keep digging deeper.
Have you considered having AI code review the AI code before giving them off to a human? I've been experimenting with having claude work on some code and commit it, and then having codex review the changes in the most recent git commit, then eyeballing the recommendations and either having codex work the changes, or giving them back to claude. That has seemed to be quite effective so far.
Maybe it's turtles all the way down?
Garage Duo can out-compete corporate because there is less overhead. But Garage Duo can't possibly output the sheer amount of work matching with corporate.
I think that the reason LLMs don't work as well in a corporate environment with large codebases and complex business logic, but do work well in greenfield projects, is linked to the amount of context the agents can maintain.
Many types of corporate overhead can be reduced using an LLM. Especially following "well meant but inefficient" process around JIRA tickets, testing evidence, code review, documentation etc.
There have been methods to reduce overhead available over the history of our industry. Unfortunately almost all the times it involves using productive tools that would in some way reduce the head counts required to do large projects.
The way this works is you eventually have to work with languages like Lisp, Perl, Prolog, and then some one comes up with a theory that programming must be optimised for the mostly beginners and power tooling must be avoided. Now you are forced to use verbose languages, writing, maintaining and troubleshooting take a lot of people.
The thing is this time around, we have a way to make code by asking an AI tool questions. So you get the same effect but now with languages like JS and Python.
AI won't give you much productivity if the problem you're challenged with is the human problem. That could happen both to startups and enterprises.
The job of anyone developing an application framework, whether that's off the shelf or in-house, is to reduce the amount of boilerplate any individual developer needs to write to an absolute bare minimum. The ultimate win isn't to get "AI to write all your boilerplate." It's to not need to write boilerplate at all.
Expect to see more “replace rather than repair” projects springing up
Complex legacy refactoring + Systems with poor documentation or unusual patterns + Architectural decisions requiring deep context: These go hand in hand. LLMs are really good at pulling these older systems apart, documenting, then refactoring them, tests and all. Exacerbated by poor documentation of domain expectations. Get your experts in a room weekly and record their rambling ideas and history of the system. Synthesize with an LLM against existing codebase. You'll get to 80% system comprehension in a matter of months.
Novel problem-solving with high stakes: This is the true bottleneck, and where engineers can shine. Risk assessment and recombination of ideas, with rapid prototyping.
Force the LLM to follow a workflow, have it do TDD, use task lists, have it write implementation plans.
LLMs are great coders, but subpar developers, help them be a good developer and you will see massive returns.
I did not get the impression from this that LLMs were great coders. They would frequently miss stuff, make mistakes and often just ignore the instructions i gave them.
Sometimes they would get it right but not enough. The agentic coding loop still slowed me down overall. Perhaps if i were more junior it would have been a net boost.
If you go the pure subjective route, I’ve found that people conflate “speed” or “productivity” with “ease.”
I don’t think I can do the approach justice here, but the short version is that they have the developer estimate how long a change will take, then randomly assign that task to be completed with AI or normally and measure how long it actually takes. Afterwards, they compare the differences in the ratios of estimates to actuals.
This gets around the problem of developer estimates being inaccurate by comparing ratios.
We have a lot of useless work being done, and AI is absolutely going to be a 10x speed up for this kind of work.
- corporate
WHY CANT OUR DEVICES RUN TECHNOLOGIES ??????
- also corporate
In programming we've often embraced spending time to learn new tools. The AI tools are just another set of tools, and they're rapidly changing as well.
I've been experimenting seriously with the tools for ~3 years now, and I'm still learning a lot about their use. Just this past weekend I started using a whole new workflow, and it one-shotted building a PWA that implements a fully-featured calorie tracking app (with social features, pre-populating foods from online databases, weight tracking and graphing, avatars, it's on par with many I've used in the past that cost $30+/year).
Someone just starting out at chat.openai.com isn't going to get close to this. You absolutely have to spend time learning the tooling for it to be at all effective.
edit: a lot of articles like this have been popping up recently to say "LLMs aren't as good as we hyped them up to be, but they still increase developer productivity by 10-15%".
I think that is a big lie.
I do not think LLMs have been shown to increase developer productivity in any capacity.
Frankly, I think LLMs drastically degrade developer performance.
LLMs make people stupider.
A program is a series of instructions that tell a computer how to perform a task. The specifics of the language aren't as important as the ability to use them to get the machine to perform the tasks instructed.
We can now use English as that language, which allows more people than ever to program. English isn't as expressive as Python wielded by an expert, yet. It will be. This is bad for people who used to leverage the difficulty of the task to their own advantage, but good for everyone else.
Also, keep in mind that todays LLM's are the worst they'll ever be. They will continue to improve, and you will stagnate if you don't learn to use the new tools effectively.
I've been going the other way, learning the old tools, the old algorithms. Specifically teaching myself graphics and mastering the C language. Tons of new grads know how to use Unity, how many know how to throw triangles directly onto the GPU at the theoretical limit of performance? Not many!
Understanding a "deeper" abstraction layer is almost always to your advantage, even if you seldom use it in your career. It just gives you a glimpse behind the curtain.
That said, you have to also learn the new tools unless you tend to be a one man band. You'll find that employers don't want esoteric knowledge or all-knowing wizards who can see the matrix. Mostly, they just want a team member who can cooperate with other folks to get things done in whatever tool they can find enough skilled folks to use.
This is the first technology in my career where the promoters feel the need to threaten everyone who expresses any sort of criticism, skepticism, or experience to the contrary.
It is very odd. I do not care for it.
this hostile marketing scheme is the reason for my hostile opposition to LLMs and LLM idiots.
LLMs do not make you smarter or a more effective developer.
You are a sucker if you buy into the hype.
Have you considered a career in plumbing? Their technology moves at a much slower rate and does not require you to learn new things.
There's a debate to be had about what any given new technology is good for and how to use it because they all market themselves as the best thing since sliced bread. Fine. I use Sonnet all the time as a research tool, it's kind of great. I've also tried lots of stuff that doesn't work.
But the attitude towards everyone who isn't an AI MAXIMALIST does not persuade anyone or contribute to this debate in any useful way.
Anyway if I get kicked out of the industry for being a heretic I think I'll go open an Italian restaurant. That could be fun.
Fair enough. It's reasonable to debate it, and I'll agree that it's almost certainly overhyped at the moment.
That said, folks like the GP who say that "LLMs do not make you smarter or a more effective developer" are just plain wrong. They've either never used a decent one, or have never learned to use one effectively and they're blaming the tool instead of learning.
I know people with ZERO programming experience who have produced working code that they use every day. They literally went from 0% effective to 100% effective. Arguing that it didn't happen for them (and the thousands of others just like them) is just factually incorrect. It's not even debatable to anyone who is being honest with themselves.
It's fair to say that if you're already a senior dev it doesn't make you super-dev™, but I doubt anyone is claiming that. For "real devs" they're claiming relatively modest improvements, and those are very real.
> Anyway if I get kicked out of the industry for being a heretic I think I'll go open an Italian restaurant.
I doubt anyone will kick you out for having a differing opinion. They'll more likely kick you out for being less productive than the folks who learned to use the new tools effectively.
Either way, the world can always use another Italian restaurant, or another plumber. :)
If they can't, did they really do it in the first place?
Are they actually literate in the programming languages they're using?
Here's where our opinions differ - I think replacing that Figma person with AI prompts will negatively affect product in a way that is noticeable to the end-user and effects their experience.
It does of course depend what kind of product you're making, but I'd say most of the time this holds.
To give but one example, effectively all of the >$300B mobile app market. Or all enterprise software that can't run on Electron. Or any company that cares about image/branding across their products, which is every single company past a certain size (and don't come at me with "but hot AI startup uses Shadcn and are valued at X trillion").
Could people write scientific code without python? If they can't, did they really do it in the first place?
Could people write code without use after free bugs without using a GC'd language? If they can't, did they really do it in the first place?
Could people make a website without WYSIWYG editor? If they can't, did they really make a website?
I think graduates of these programs are far, far worse software developers than they were in the recent past.
edit: i think you mean "irrelevant", not "irreverent". that being said, my response is an expansion of the point made in my comment that you replied to.
But this subthread is about interns who did not study CS, and are able to create advanced UIs using LLMs in the short time they had left to finish their project.
That being said, I half agree but I think we see things differently. Based on what I've seen, the "illiterate" are those who would have otherwise dropped out or done a poor job previously. Now instead of exiting the field, or slowly shipping code they didn't understand (because that has always been a thing) they are shovelling more slop.
That's a problem, but it's at most gotten worse rather than come out of thin air.
But, there are still competent software engineers and I have seen with my own eyes how AI usage makes them more productive.
Similarly, some of those "illiterate" are those who now have the ability to make small apps for themselves to solve a problem they would not be able to before, and I argue that's a good thing.
Ultimately, people care about the solution to their problems, not the code. If (following the original anecdote) someone with an LLM can build a UI for their project I frankly don't think it matters whether they understood the code. The UI is there, it works, and they can get one with the thing that is actually important: using the UI for their bigger goal.
would you agree that LLMs make developer stupider?
edit: answer my question
Looking at the brief history of their account, I don't think anything they are saying or asking is in remotely good faith.
As a comment reader this exchange with Simon translates directly to "no, but you have forced me to try and misdirect because I can't reply in good faith to an expert who has forgotten more about LLMs than I'll ever know".
just write the code
developers can exist in a small team, solo, large enterprise all with their mandates and cultures so just saying LLMs increase/decrease is reductive.
have a feeling i'm being trolled tho.
I think LLM addicts are particularly susceptible to flattery.
There are a lot of sad people who have developed parasocial relationships with ChatGPT, but that's entirely a separate issue from whether agents are a good tool for software engineering.
They don't emerge looking credible, either.
I ran a three month experiment with two of our projects, one Django and the other embedded C and ARM assembler. You start with "oh wow, that's cool!" and not too long after that you end up in hell. I used both ChatGPT and Cursor for this.
The only way to use LLMs effectively was to carefully select small chunks of code to work on, have it write the code and then manually integrate into the codebase after carefully checking it and ensuring it didn't want to destroy 10 other files. It other words, use a very tight leash.
I'm about to run a six month LLM experiment now. This time it will be Verilog FPGA code (starting with an existing project). We'll see how that goes.
My conclusion at this instant in time is that LLMs are useful if you are knowledgeable and capable in the domain they are being applied to. If you are not, shit show potential is high.