The gauge broke: devs felt 20% faster with AI, measured 19% slower(intrepidkarthi.com)

70 pointsby intrepidkarthi5 hours ago27 comments

ianbutler4 hours ago
2025 is such old news that this just isn't relevant.
METR already redid the study at a later date and now finds a likely 18% speedup
"For the subset of the original developers who participated in the later study, we now estimate a speedup of -18% with a confidence interval between -38% and +9%" (note their use of - and + here could be slightly confusing but they do mean 18% faster per the post)
https://metr.org/blog/2026-02-24-uplift-update/
- dofm4 hours ago
  Their followup study essentially says the followup study itself is possibly broken because developers will now not participate in some of the non-AI tasks and because the study pays less.
  I would not, at all, suggest that this second study corrects or debunks the first.
  Instead what it shows (if anything, i.e. if you can even put aside the regrettable choice to change the payment level, which affects applicant recruitment) is that the mindset shift has already happened: developers now don’t want to attempt some tasks without AI.
  What that tells you is not (with any confidence at least) that they are faster, but perhaps that we are beyond the point that this can be meaningfully measured. AI could still be making developers slower, but developers aren’t going to be willing or perhaps able to help you find out.
  Basically the job is different now.
  What this does for me, perhaps, is vindicate my feelings. I can do agentic coding; I have learned the principles and some tools and I could learn more. But if this study is really reflective of how other developers feel now, I am done.
  - keeda2 hours ago
    The original study itself had at least one developer who later revealed that he had filtered out tasks he prefered not to do without AI: https://xcancel.com/ruben_bloom/status/1943536052037390531 -- given the N was 16, and he seems to have been one of the more AI-experienced devs, and we don't know if the other devs did this, the results of the first study itself could be questioned.
    dofm2 hours ago
    I am not at all suggesting the first study is good, or that I believe its conclusions.
    (Or that the failure of the second study validates the conclusions of the first.)
    I am just saying that people here who think the second study overturned, debunked or corrected the findings of the first are explicitly wrong, because even its authors admit it is a broken study.
    It would take a non-broken study to do that, and it may not actually be possible anymore, which is perhaps the most useful finding of the second study.
- torginus4 hours ago
  Either way, it's not a dramatic improvement. Thankfully I work in an environment where with little bureaucracy so my time is actually spent doing technical work.
  I do think AI has been a huge boon to productivity in many ways, but looking at feature timelines, I think it's pretty clear the 'critical shortest path' of key features hasn't been sped up by that much.
- js84 hours ago
  They also say "Wider adoption of AI has made it more difficult to measure task-level productivity"
  I think there is a simple reason for that. If you automate something, you make the measureable/predictable thing faster. So the hard to measure/predict part of the job will take more share of the time, and overall difficulty to measure/predict goes up.
  I think this is what happened with Agile Scrum - as developers became more productive (for unrelated reasons, two main sources of SW developer productivity before AI were compilers and open source), the bureacracy (amount of meetings) increased, because the ratio of hard to measure vs easy to measure went up. Bureacracy is hard to measure, so it went up (as a share of work). I expect this only getting worse with more automation, such as AI. So I predict an increase in share of bureacracy compared to pre-AI world.
  Either way, IMHO main point is automation has the opposite effect on human job predictability, it lowers it. Tasks we can easily automate are those that are easy to predict.
  - nikau4 hours ago
    I've held this stance on agile for a long time - it coincided with mainstream adoption of ssds, windows with memory protection and google search - all of which sped delivery despite agile, not because of agile.
- verzali4 hours ago
  That post literally says the results are unreliable...
  - lars5124 hours ago
    ...and in particular it says that one of the reasons is that developers are refusing to participate in the non-AI branch, and when they do, changing what tasks they select to those where AI would be less useful.
    Overall this suggests to them that the current speedup is likely greater than what the study could measure.
    dofm4 hours ago
    It might suggest that but they can’t back it up because the study is broken and perhaps forever unrepeatable.
    Like, what people are saying is, “That old study was wrong! They did a new broken study that overturned it!”
- krige4 hours ago
  Even not touching the laughable sample size for both studies - almost halved sample size between 2025 and 2026? Sounds like a massive selection bias, and not in the way they're implying.
- markbao4 hours ago
  And to be specific, the METR study was using the Cursor harness with Claude Sonnet 3.5/3.7, along with other models of that era of the participant’s choosing.
  Which is ancient at this point, and half a year older than the November 2025 inflection point when agentic coding got really good.
  The original article is from August 2025, and the overall message to not trust ‘how it feels’ and rather measure outcomes seems right to me despite the outdated figures. On my team at least, we are seeing a noticeable inflection in work shipped with AI according to Weave.
loveparade4 hours ago
These studies are meaningless because speedup is heavily dependent on the kind of work you're doing. No doubt that you can do mechanical refactors 100x faster with AI, and also no doubt that using AI will be slower for tasks where it's less about writing code and more about context/world knowledge or building understanding. Averaging across these tasks doesn't make sense because everyone's work consists of a different distribution of tasks.
A frontend dev doing tailwind integration for his day job is gonna see very different speedups than someone working in a niche scientific codebase. Taking the average makes about as much sense as taking the average of the speedup from calculators for a mathematician, a farmer, and an elementary school student.
- Cthulhu_3 hours ago
  And even if the refactor is 100x faster (which is unlikely) - would that work have been done at all without AI? What would the effect be on total speed / througput if it was or was not done? These things get more and more difficult to measure.
  What I'm really wondering is how much extra tasks are being done that wouldn't have been done without AI, and whether those actually have a payoff. That is, 100x faster versus not doing the work at all.
  I'm hoping some enterprises that collect metrics on e.g. time-to-market, customer satisfaction, revenue, costs, etc will release an authoritative report some day.
- rob744 hours ago
  To add to that, the only way to reliably measure speedup would be to give the same developer the same task twice, first without AI, then with AI, and the developer should have no previous knowledge (or the same level of previous knowledge) about the task each time he starts - which is inherently impossible. I didn't read the study, but from the article it looks like they compared the actual speed to prior estimates, and we all know how reliable those are?
- Zababa3 hours ago
  >and also no doubt that using AI will be slower for tasks where it's less about writing code and more about context/world knowledge or building understanding
  This isn't true in my experience, AI is great at gathering context through slack, repositories, emails, web pages. For building understanding too, provided you use it well.
- dakolli4 hours ago
  I have found llms to be utterly useless for frontend (tailwind included).
  That is, unless you're building a single page app/landing page that is the typical center column with a hero and below that a 3x3 feature grid with those same 3 colors that all the sloppers show off.
  I'm not a frontend dev, but these statements are starting to get outright disrespectful to those that are. Do you people understand how much "world", customer and product knowledge is required to design and implement great UX/UI?
  I promise you are not going to be able to translate all this internalized understanding to an LLM and have it do your "tailwind integration" It actually sucks at all frontend outside of the 3 types of page layouts it understand.. Shitty landing pages, generic dashboards and shitty blog layouts.
  Ya'll yearn for slop though so maybe everything will just become shit anyways.
  - loveparade4 hours ago
    Fair point, I was more trying to make a statement about the amount of training data available, not the "difficulty" of the task. I just used Tailwind as an example because it is so ubiquitous with so much training data for LLMs to learn from, while any niche application doesn't have that.
    Zababa2 hours ago
    Training data has stopped being a good predictor of LLM abilities ever since they started doing heavy RL runs. I'm not sure how much corporate dashboards/I can't believe it's not excel stuff were in the training data, I guess not that many considering that stuff is almost always corporate and kept inside companies, and LLMs are still great at it, good enough to make people that used excel and used it well daily for 10+ years stop using it for lots of stuff.
  - Zababa2 hours ago
    This isn't my experience at all, LLMs do graphs and more complex excel-like web pages very well. They also do dashboards very well. They even seem to do 3D stuff with three.js like video games pretty well too, although I haven't tried that myself. Maybe they can't do something "great", but they sure can do good enough in most cases, and good enough is already better than most websites.
    >I'm not a frontend dev, but these statements are starting to get outright disrespectful to those that are.
    Agree here, especially considering that usually "niche scientific codebases" have terrible code so you don't need a super smart model to get a good bost in software engineering.
  - 4 hours ago
    undefined
  - weitendorf3 hours ago
    I ran into this with my first attempt at building a static site generator "for agents" last year. https://statue.dev
    I got very frustrated with LLMs and their inability to apply good taste or maintain consistent design languages, and put the project on ice. But I decided to double down on more tooling and learn as much about frontend as I could because I also realized that frontend itself - the problem domain, the engineering culture (or lack thereof), the historical baggage, the sheer size of the frontend api/language surface was part of the problem. And also there was/is a lack of good LLM and agent-oriented tooling that was a much deeper problem than I expected initially.
    I originally thought I would just create skills/workflows and apis for generating sites from templates, but the problem is moreso that you need an entirely different kind of harness and development process for frontend, which doesn't really exist yet. Claude design is probably the most familiar gesture in that direction for most people but I think it's only scratching the surface. Our own "agentic playwright" is https://github.com/accretional/chromerpc/tree/main/chrome-pr... - IMO this kind of tool (both ours and Claude Design) is a major win for removing the largest, most frustrating frontend LLM painpoints (having a human doing QA and prodding the model to fix obviously-wrong outputs).
    But the bigger problem is that the webdev tooling ecosystem is FUCKING AWFUL, and there are too many different ways to do something even using the actual base browser apis, let alone all the random ass low-quality tools and cargoculting that seeps into the models' way of working and thinking. That's not to say that tools like React are bad, necessarily, but that there is so much pre-LLM slop and churn and low quality/inconsistent work in the frontend ecosystem that you really need to be MUCH more knowledgable about the way browsers and the web actually work than the median frontend developer (especially the ones participating in the endless hype flavor of the months, generating all the noise that defines the engineering culture) to effectively use them. Or even better, if you know enough you can also NOT reach for them because you're able to just implement it via raw html/css/browser primitives instead of through 2000 node packages.
    To be clear, I'm not saying frontend development is slop, but that it has a very high skill ceiling and requires a lot of very particular/thorny knowledge to be good at. I think the reason AI frontend looks so much like slop is that it hasn't been RLed against the actual web-standards in a way that lets it learn how to actually build good sites, it just has the median frontend engineer archetype from its pretraining and then some kind of RLVR to get it to produce workable, not-fucked-up code (the 3x3 grid, the slop hero, the unnecessary blinking green buttons, etc.). And also, for LLMs, maybe engaging with the webdev tool ecosystem beyond the core infrastructure layer and base apis/languages is more trouble than it's worth, because they often optimize for "I want a particular kind of UX and don't know how to implement it directly, but I do know how to find a package and call it, then prod it into working".
    LLMs need something more like a browser-harness, a meta-design system, per-design-language component management tooling, and a non-slop build system. They also generally need much better support/more sophisticated UX for hierarchical iframes, CSP, etc. which is a space that is not very well-explored despite its potential, because most frontend devs find it too hard or complicated.
    People are already starting to build these and I think we'll get there in the next year. The hardest piece of the puzzle is figuring out how to structure RL training envs to learn frontend directly against web standards, because web standards are very complex and high-surface; but this is also the most promising because it's how you get Mythos-like superhuman performance. We have a project to build some of the base domain modeling/search tooling needed for frontend RL, eg https://github.com/accretional/proto-css, but it's early days. You should definitely try agentic browser tooling if you haven't yet because it makes a huge difference in getting existing LLMs to be more effective at frontend, and automating most of the debugging. It's what allows us to eg fully automate creating gifs of models interacting with our site in the context of a user journey when we run tests: https://github.com/accretional/proto-css/tree/main/chrome-te...
shaky-carrousel4 hours ago
There was a study that people using the keyboard instead of the mouse felt they were working faster but in fact they were working slower. A perception thing. Users were more engaged when using a keyboard.
- shawabawa34 hours ago
  I'm convinced this is what causes people to feel productive with vim
  - 100ms4 hours ago
    Triggered by both of these comments.. interaction mode dictates a style of thinking. I have to use a mouse, I'm forced to use my eyes, which also means I probably have to use a massive screen. I have to pay attention to some hyperactive Intellisense-like feature, I'm forced to remove my attention from the problem.
    It's like saying you're convinced people reporting they feel more productive in a mauve-coloured room are liars, or those that drive automatic vs manual. Maybe they just find muave a restful colour?
    weitendorf2 hours ago
    I think most people who strongly identify with tools like vim do so out of a sense of identity-building to "be the kind of developer who is good at vim" / embody some kind of aesthetic or in-group signal moreso than an actual desire to be more effective at getting work done.
    As long as you don't have some kind of stochastic or >5s impediment taking you out of a state of flow, most developers' productivity is going be vastly more influenced by their knowledge, understanding, and ability to focus on the problem they are working on than the marginal difference in time it requires to perform some navigation or editing task. Which is not to say that vim is bad or that you shouldn't use it, but that it's just a text editor and if you get triggered by someone not liking it or thinking it's more trouble than it's worth, it might be worth taking a step back and thinking about why it's something that triggers an emotional/defensive response, rather than the kind of reaction you'd have to someone liking strawberry more than vanilla.
    100ms2 hours ago
    > I think most people who strongly identify with tools like vim do so out of a sense of identity-building to "be the kind of developer who is good at vim" / embody some kind of aesthetic or in-group signal moreso than an actual desire to be more effective at getting work done.
    This is the exact same sweeping inferential leap as the original comment. I happen to think people who drive red cars do so only because they want to incite a sense of danger and potency in their road opponents, people who wear boots obviously want to identify with Ukranians on the front lines and any claims it helps with their flat feet are obvious rubbish.
    Tooling and language obsession is boring and borderline offensive to anyone who has been around for a few years. Imagine walking into someone's workplace and demanding they replace their well worn chair, would you do it? Imagine insisting someone use vim because their IDE didn't have a natural pipe-through-shell-command function.
    Avshalom3 hours ago
    >> I'm forced to use my eyes, which also means I probably have to use a massive screen. I have to pay attention to some hyperactive Intellisense-like feature
    What the hell are you talking about?
    100ms2 hours ago
    I can quite easily (and often do) use a basic editor while staring at the wall. I've yet to use an IDE where there wasn't some idiotic race between keystrokes and whatever random latency language server just told it to insert parens or a newline after you already typed them, assuming the text is even visible on a 13" screen buried in sidebars and "essential" extensions. They're full attention tools which is a completely different mode of work than is otherwise possible.
    It's not to say an IDE isn't a useful thing, they just have their place like anything else. I personally find autocompletion useful for a couple of weeks going into a new language or project after which it's very often more a distraction than a productivity enhancer. Same goes with e.g. Git integration. I wouldn't presume to say a Git integration user simply needs to learn Git in much the same way I wouldn't expect someone to tell me that I can't use Git just because I don't use the IDE Git integration. They're just tools
  - dakolli4 hours ago
    people that use vim motions/shortcuts/keyboard workflows are more productive, this is undeniable..
    Shitty-kitty4 hours ago
    Vim makes some slow and incredibly tedious tasks, fast and efficient. Having said that, all those key-presses to switch modes do add up.
    Rekindle80904 hours ago
    [dead]
  - globular-toast3 hours ago
    Do you have evidence to show they are not, in fact, more productive? It's easy to find things where a mouse is comically slow compared to vim. But any kind of religious adherence to tools isn't going to be the most efficient way. Like with just about anything you need to find a balance. An extremely large set of tools is just as bad as an extremely small set.
- weitendorf2 hours ago
  There are multiple developer subcultures (nominally for productivity but mostly hobbyists) pretty much exclusively motivated by installing and configuring complex, visually-dense, high-learning-curve tooling and editor setups driven by the same psychology.
  Besides the cultural association between keyboard navigation and complex tools with being a 1337 h4xx0r, I think there is something to be said about the process of tinkering with and learning how one's own tools work, or more generally experimenting with new, "interesting" ways of working than the default choice (which around where AI was at the time of this study), and being more engaged and thus more knowledgable about one's own work or problem domain, even if the overhead ends up being a poor investment time-wise upfront. Personally, if it took me 20% longer to accomplish something but I understood it 95%, vs 75% if I had done it the "fast way", I would almost always take the 20% latency hit, with the expectation that more knowledge/exposure to different tools and techniques would have much better ROI over time than marginally faster delivery.
  There's a certain kind of developer (much moreso the kind all-in on AI in early 2025) who thinks that AI is really smart and knowledgeable and assigns a high degree of confidence/deference to its responses, the same way you might to a venerated subject matter expert or wikipedia/stackoverflow/google search result. To this person involving the AI lends more credibility/confidence in their work and their own understanding of it (vs if they just uncritically copied code off stackoverflow). Better understanding this kind of user made me realize that the quality signals and mental models people build around productivity can vary immensely even within the same profession or team. Productivity-hacking is a lot more about vibes and identity-construction/tribal affiliations than most people would like to admit.
- Cthulhu_3 hours ago
  I'm sure this is the case; in both cases, the amount of "things happening" is higher (keystrokes, perceived actions per second, tokens used, words generated by LLM while it's "thinking", back and forth betwee LLM and user, etc) - but does it translate to measurable end results? tickets closed, features delivered, money earned vs cost (personnel + AI + hosting + other costs), etc.
  As always this is hard to measure, or will take a while longer yet to draw conclusions from.
jdkoeck4 hours ago
> The honest counter, and it matters here more than usual. This is most likely the dip in a J-curve, not the destination.
Oh, the irony of this post being AI-generated.
- titanomachy4 hours ago
  I don’t understand how these things still write so annoyingly. Eliminating just a handful of tells would make a big difference.
michaelt4 hours ago
One thing I've noticed with generative AI is it's now easier than ever to write more lines of code.
Before, a backend guy asked to add an intranet page would make an austere page -bare html with barely any styling or javascript. Today, the same guy given the same task can turn in something with styling, javascript, internationalisation, interactive form validation, progress spinner, minification build stage, linting, maybe even automated browser tests.
And I have to code review it. Now the bottleneck of writing the code has been removed, I now find code review is the bottleneck - and a bottleneck facing much higher flow must either let more through, or start applying back pressure.
Sometimes I think an evil genie granted my wish for better tested code by trying to drown me in it.
- globular-toast3 hours ago
  I mean they could have turned in one of the countless HTML "boilerplate" projects before LLMs too. It hasn't been necessary to start completely from scratch for quite a while now. I'm surprised any professional web developer wouldn't maintain their own boilerplate as templates that they can quickly roll off the production line. Tools like cookiecutter[0] have been available for a long time. Sometimes I feel like LLMs are just allowing people to somewhat catch up to where others were decades ago.
  [0] https://github.com/cookiecutter/cookiecutter
onion2k4 hours ago
Generation got cheap. Verification got expensive.
That proves AI is capable of doing one part of the software engineering process. The 16 devs in the study trusted AI to write the code. Once we trust AI to do the verification as well we'll realise the gains we feel we're getting now. Essentially we're intentionally going slower on the second half because the trust is missing.
Alternatively, rather than trusting AI to do the validation, we could follow the vibe-coder approach by skipping the validation entirely, and trust that the generation stage is good enough not to need it. Historically that's come with some small downsides, like the code being a broken mess of security holes, but with time AI might fix that.
- beaugunderson3 hours ago
  > Once we trust AI to do the verification as well we'll realise the gains we feel we're getting now.
  I built a UAT agent on top of claude-agent-sdk, it uses Playwright and can spin up a preview instance for PRs we open. It uses its knowledge of the code to create a test plan, runs that test plan, and takes screenshots as it does so. On a recent PR I made a change to our MFA implementation and assumed that I would need to test it myself since MFA requires setup (and a TOTP device!) but no, the agent drove Playwright to change the instance configuration to enable MFA, created a user with MFA enabled, wrote the tiniest (handful of lines) Python TOTP and used that as the device and successfully verified that MFA still worked as expected. I did not expect that ability, and I wrote it! It was also exhaustive in its probe of functionality, way more than I had planned to do, even though I had planned to do a lot given the critical nature of MFA.
  Total token price for the UAT run: $5.69.
  - weitendorf2 hours ago
    I built a very similar tool recently mentioned elsewhere in these comments. I think with the current state of LLMs, harnesses, and related tooling, being able to create or setup self-eval tooling is the biggest differentiator between merely using LLMs to write code vs realizing true 10x productivity wins.
    I'm curious whether this is something LLMs are eventually going to be good enough at doing, or something the average developer knows when and how to do, or if this is going be something that's too specialized or difficult for most developers and maybe the next generation of developer tooling products. Now that we're several months into Claude Code crossing the threshold of legitimacy and adoption, I've been surprised at how few projects or developers are doing this yet.
    To a certain extent now all you need to do is ask Claude Code for browser automation workflows and CUJ tests in your repo, and ye shall receive, but probably something that just uses base playwright. It would be even better if you could ask to install or use a self-eval tool that already did everything you needed it to do and also knew how to specify/setup automations. I'm assuming the level of agency or mental overhead of embarking on a browser automation side quest is beyond what most developers are used to in the course of their regular work, even though it's not really as hard as it sounds now. If so then self-eval tooling could be a very promising new product category to sell to enterprises.
    BTW if you have a link to your project I'd be interested in checking it out! $5.69 for a UAT run sounds very high to me based on how many tokens it typically takes for agents to create automations or steer my similar project, but it could be that your test workloads are much more exhaustive or high-dimensionality than mine are. This is what a basic "go to amazon.com and search for a product, then take a screenshot" automation looks like for ours: https://github.com/accretional/chromerpc/blob/main/recipes/s.... And this is our interactive/dynamic remote steering mode: https://github.com/accretional/chromerpc/blob/main/chrome-pr.... I decided to implement against the Chrome Devtools Protocol (one layer under Playwright) and use grpc service reflection to allow agents to dynamically discover/describe the entire chrome devtools api surface. I just started working on a way to gather traces and monitor/manage the automation run internals because I think there's a ton of opportunity in this problem space for orchestration and RL
- hnlmorg4 hours ago
  I wouldn’t even trust experienced developers to merge code without peer review.
  - onion2k4 hours ago
    I would, but that's mostly because I don't trust PRs to catch real problems. Someone reviewing the changes in a codebase is never going to spot an architectural or code design issue, and those are the real ones you need to care about. In my experience 95% of everything that's caught by a human PR review could have been caught by a linter or a formatter before the PR was opened.
    If you trust your team to care about quality then PRs aren't necessary, and if you don't then why are you trusting them to catching problems in PR reviews?
    IanCal3 hours ago
    Outside of the benefit of some extra documentation around changes and having more than one person see what’s happened there are a few main safety parts here:
    * It can be easier to see a problem in something that another person wrote, you’re not clouded by what you intended to write
    * different skill sets mean two sets of eyes broadens the kind of problems that can be found (benefits from selecting good reviewers, maybe I’m tagged because I know the llm APIs and performances better by someone who has done refactoring to improve, say, internal caching that’s more their thing)
    * just chance. There’s some chance you spot an issue, some chance someone else does. Combined it’s better.
    * I disagree they can’t find architectural or design issues. You see repeated changes of the same kind, or tying together things that shouldn’t be, etc.
    But yes, many things could be caught before opening. Lots I catch as I’m explaining the change, like rubber ducking. I quite like AI code reviews things for this, there’s a whole back and forth that can be avoided once you get past basic linting/test level things. Save the human time for understanding the higher level issues.
  - roenxi4 hours ago
    It is getting to the point where I wouldn't trust experienced developers to merge code without AI review. The latest generation of models are getting pretty good.
    IanCal3 hours ago
    Definitely for the time and cost too.
    If a dev costs $1k/day that’s $30 to spend 15 minutes looking at a pr. How much review does that in tokens get you? I’d wager you’d easily find lots of low level issues that are beyond basic linting with that reliably, and I feel like latest gen models can do really good work.
- alfalfasprout4 hours ago
  small downsides like security holes? Those aren't small. Neither is creating a codebase that's an inextensible mess that even LLMs begin to struggle with.
  The reality is making good decisions and thinking about approaches take time. AI can absolutely make us faster at it but it's not magic and these speedups come with effort.
  - onion2k4 hours ago
    small downsides like security holes? Those aren't small
    I'm British. I've been taught to turn understatement into an art.
xg154 hours ago
> AI speeds up typing, which was never the bottleneck for an expert in a codebase they already know.
For me as a dev, that's not the whole truth. Where I've found actual value in AI (and I think were some of that "perceived speedup" is coming from) is looking up things.
Unless you know the codebase and used libraries extremely well, you will have to do lots of "micro-lookups" during coding, where you have to find the specific APIs or library functions for your problem, then figure out how exactly you have to call them, how to handle the result, etc. That's lots of "research" work interleaved with actually writing the code.
AIs seem to be good enough to have a lot of that knowledge already baked into their weights, at least for popular platforms, so if you prompt it something, you can skip all that low-level lookup work or at least defer it until code review. Even during review, it's easier, because you don't have to come up with the appropriate library function from scratch, you only have to verify that the ones the AI used make sense and are used correctly.
orphereus4 hours ago
As I read the blog post, I thought that it was released today. Maybe point out that it is almost a year old. It feels like it is manipulating HN users.
And this is coming from an AI sceptic.
sigmoid104 hours ago
If you linked to the actual source of the study [1] instead of a random blog only talks about the result, you would see the big banner that the authors put there noting that the study is horribly outdated. Current models do make developers faster.
[1] https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
wewewedxfgdf4 hours ago
My two bosses are anti-AI.
Whenever I tell them about how awesome AI is, they come back with stories about how they used AI and it couldn't even do anything basic and what it did do had errors.
People will always create a world narrative that matches what they already believe.
Anti AI people are always quoting these "facts" about how AI reduces productivity even when developers feel it increases productivity - it reinforces their world view.
- Almondsetat4 hours ago
  >feel
  Productivity is not a feeling though. Either you show an increased productivity or it doesn't exist
  - IanCal3 hours ago
    It is however extremely hard to measure accurately with software engineering and every easy measurement immediately draws ire from devs here.
    Cthulhu_2 hours ago
    Yes and no. You can measure things:
    * features delivered * time-to-market (idea -> production) * code quality metrics * bugs found / bugs fixed * tickets opened / closed (and with what resolution) / month * revenue vs expenses * (if you must) commits / month, code changes / month
    But few software developers seem to do that because it's overhead.
  - Shitty-kitty4 hours ago
    Problem is, all the old metrics are now obsolete and nobody knows how to measure productivity anymore.
  - 4 hours ago
    undefined
  - dymk4 hours ago
    Performance review at FAANGs has always been vibes and soft skills.
    hparadiz4 hours ago
    Shipped projects don't lie.
    Almondsetat4 hours ago
    And? That's not the point of discussion
- 4 hours ago
  undefined
- croes4 hours ago
  But this isn’t just a narrative but a study. Limited but a study
  - jdkoeck4 hours ago
    A study from 2025. Might as well be from five years ago.
    croes5 minutes ago
    So where are all those profits from the higher productivity?
  - mschuetz4 hours ago
    Having participated in many studies, I lost much faith in studies. You could study the same thing, and get opposite results depending on how you build the study, which people participate (friends and colleagues will be reluctant to speak against your results), and the bias with which you set everything up. Also, many tasks require learning the tools, and some tools will start to be more productive with expertise than others.
- 4 hours ago
  undefined
vachina4 hours ago
It will feel slower because I finish my task in 1/20th the time and spend the remaining time browsing HN.
luckilydiscrete4 hours ago
https://www.faros.ai/blog/ai-software-engineering
The actual study with the data, minus the "I was right all along" commentary
raincole4 hours ago
Yeah, when people who are not familiar with AI and use Cursor with Sonnet 3.7 they are only 19% slower. In retrospect that research was very bullish for AI.
make_it_sure4 hours ago
why is this on the front page. It's an old debunked study lol, not relevant at all today. Read more than just the title
miika4 hours ago
Well.. feelings have never been a good way to measure quantities.
jdjdjrbrbrb3 hours ago
I honestly don't care if I am faster or slower ... I am sooooooooo much less burnt out... And that metric is infinitely better for me and my company.
bezier-curve4 hours ago
One thing I see missing in a lot of these discussions is whether or not the metric is solely based on speed. I think AI just allows you to look at your code in different ways and provides more chances to catch mistakes. I am definitely slower with AI assistance, but that is because I use it to increase the quality of my work.
larodi4 hours ago
So are you trying to imply that I've somehow accidentally stumbled upon more than 100k lines of new high-perf working code, done in less than 6 months, which is like not 20%, but 200% my actual output, and this code, already generating revenue for me and my employer, is something of the ordinary, and actually I can do 20% better typing it manually?
ROFL sorry
4 hours ago
undefined
DonHopkins4 hours ago
Luc Barthelet, who I worked with at EA, is a Mathematica whiz (he later worked at Wolframe Research on Wolfram|Alpha), and he would prototype game ideas in Mathematica, which would render out web pages with animations.
He came up with a fun idea for a racing game renderer: it distorted the perspective transformation a bit, grading depth on a curve, so far away things would linger in the distance a bit longer, then speed up and WHOOSH past you, seeming even faster than they would be photorealisticly!
https://www.mobygames.com/person/29352/luc-barthelet/
https://community.wolfram.com/web/luc
jiggawatts4 hours ago
"19 August 2025"
This may as well have been written in the stone ages, when we were banging AI rocks together.
I just did a ~6 month project in ~2 weeks using a frontier model.
I wouldn't even have attempted this kind work a year ago, with or without the AIs available at the time!
- ImprobableTruth4 hours ago
  >I just did a ~6 month project in ~2 weeks using a frontier model.
  Claims like this are hard for me to take seriously because 'good' models have been available since the start of the year. So, if they really 10x one's productivity, then people should be able to have gotten done 5 years worth of work since then, but I've never actually seen anybody show any project like this.
  - 4 hours ago
    undefined
  - jiggawatts2 hours ago
    > 'good' models have been available since the start of the year
    today: https://www.anthropic.com/news/redeploying-fable-5
    35 days ago: https://www.anthropic.com/news/claude-opus-4-8
    70 days ago: https://openai.com/index/introducing-gpt-5-5/ <-- first model I've found useful
    77 days ago: https://www.anthropic.com/news/claude-opus-4-7
    119 days ago: https://openai.com/index/introducing-gpt-5-4/
    182 days ago: The start of the year
    ImprobableTruth42 minutes ago
    Opus 4.5/4.6 are what many people consider the first 'good' models and it's from last year/start of this year.
    But fine, let's say everything before gpt 5.5 was unusable crap. Then there should still be projects that would normally have previously taken ~2 years done in just two months. Where are they?
  - IshKebab2 hours ago
    My guess at what's happening is that people are mostly using the tools on low impact or speculative projects. Notice that he said he wouldn't have attempted it without AI.
    That's been my experience too. I had an idea that I didn't need so I hadn't bothered doing it, but AI made it easier to just have a go. I suspect people aren't using AI as much on their main profit-making projects (which also are going to be bigger, more complex and not greenfield - which is all harder for AI).
    Also give it a chance - as you said "good" models have only been available very recently and you wouldn't expect everyone to start using them instantly.
    ImprobableTruth35 minutes ago
    Sure, but "10x faster, but only applies on small greenfield, throwaway projects" is a major caveat. In fact, there's a good chance this doesn't disprove the original blog post, you could be way faster on small projects but slower on 'real' projects.
    >Also give it a chance - as you said "good" models have only been available very recently and you wouldn't expect everyone to start using them instantly.
    But I'm not expecting everyone to have built something like that, but surely among millions of users someone should have, especially the people proclaiming insane productivity gains? There are no super impressive open source projects done using AI and all the companies boasting about how all their code is AI written now don't show much improvement either.
- Shitty-kitty4 hours ago
  That's amazing. What's even more incredible is that somehow you managed to do a real code review and testing in that time-frame.
- koe1234 hours ago
  Personal or work? Used by anyone or for fun?
  - loveparade4 hours ago
    Used by Anthropic to sell tokens
  - 4 hours ago
    undefined
- suddenlybananas4 hours ago
  What was the project? Could you share the code?
tombot4 hours ago
“16 developers across 246 tasks”
- bfjvibybd6cuvu64 hours ago
  With barely any experience or training with AI.
4 hours ago
undefined
bitwize4 hours ago
This study was shown to be flawed at the time; METR has retracted it. And it doesn't take into account current frontier models.
AI makes you more productive. This is no longer up for debate. The energy you spend arguing last year's talking points is better spent knuckling down and learning the tools.
- fxwin4 hours ago
  Have they retracted it? My understanding was simply that they released results with more recent data, not that this study itself was flawed (and their website doesn't mention a retraction either)
- idle_zealot4 hours ago
  What tools? The ones that will be outmoded in 6 months, or the ones that will be banned in 6 months?
  - bitwize4 hours ago
    Whatever tools come along, step one is getting into the mindset that typing code in is no longer part of your job. You are a designer and director, not a coder. As Steve Yegge says, if you still have an IDE open entering code by hand, you're one of the crappy engineers. You need to be getting into the habit of understanding the strengths and weaknesses of your model and agentic harness and using those to produce the results you want. When those change in six months, you adapt along with them. Adjusting to the new mindset is the biggest hurdle, and there'll be plenty of devs who can't, and won't make the cut or stay in the field for very long. Just like there were plenty of devs who couldn't adjust to anything beyond COBOL on punched cards.
- wisty4 hours ago
  Partial agree ...
  I suspect:
  If you know what you are doing it is a power tool.
  If you don't know what you are doing it's also a power tool - if you measure a lot of devs then the bad ones (or anyone having a bad day, or the wrong fit for a project) can make work for everyone else at an outrageous pace.
- croes4 hours ago
  So where are the profit jumps from all the productivity gains?
  Or do you just produce more code but not more productive value?
arisAlexis4 hours ago
So why are all the top labs using the tools internally? They are lying or they are stupid?
Devs wish that was true but it isn't and it will get better.
- suddenlybananas4 hours ago
  Conversely, why do they have so many open bugs?
- dude2507113 hours ago
  AI psychosis is neither a lie nor a stupidity.
kumiko_studio4 hours ago
[dead]
DanielNeveux4 hours ago
[dead]