Make a commit.
Give Claude a task that's not particularly open ended, the closer to pure "monkey work" boilerplate nonsense the task is, the better (which is also the sort of code I don't want do deal with myself).
Preferably it should be something that only touches a file or two in the codebase unless it is a trivial refactor (like changing the same method call all over the place)
Make sure it is set to planning mode and let it come up with a plan.
Review the plan.
Let it implement the plan.
If it works, great, move on to review. I've seen it one-shot some pretty annoying tasks like porting code from one platform to another.
If there are obvious mistakes (program doesn't build, tests don't pass, etc.) then a few more iterations usually fix the issue.
If there are subtle mistakes, make a branch and have it try again. If it fails, then this is beyond what it can do, abort the branch and solve the issue myself.
Review and cleanup the code it wrote, it's usually a lot messier than it needs to be. This also allows me to take ownership of the code. I now know what it does and how it works.
I don't bother giving it guidelines or guardrails or anything of the sort, it can't follow them reliably. Even something as simple as "This project uses CMake, build it like this" was repeatedly ignored as it kept trying to invoke the makefile directly and in the wrong folder.
This doesn't save me all that much time since the review and cleanup can take long, but it serves a great unblocker.
I also use it as a rubber duck that can talk back and documentation source. It's pretty good for that.
This idea of having an army of agents all working together on the codebase is hilarious to me. Replace "agents" with "juniors I hired on fiverr with anterograde amnesia" and it's about how well it goes.
I started out by letting it write a naive C version without intrinsic, and validated it against the PyTorch version.
Then I asked it (and two other models, Gemini 3.0 and GPT 5.1) to come up with some ideas on how to make it faster using SIMD vector instructions and write those down as markdown files.
Finally, I started the agent loop by giving Cursor those three markdown files, the naive C code and some more information on how to compile the code, and also an SSH command where it can upload the program and test it.
It then tested a few different variants, ran it on the target (RISC-V SBC, OrangePI RV2) to check if it improves runtime, and then continue from there. It did this 10 times, until it arrived at the final version.
The final code is very readable, and faster than any other library or compiler that I have found so far. I think the clear guardrails (output has to match exactly the reference output from PyTorch, performance must be better than before) makes this work very well.
IIRC, Depthwise is memory bound so the bar might be lower. Perhaps you can try some thing with higher compute intensity like a matrix multiply. I have observed, it trips up with the columnar accesses for SIMD.
Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests
Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.
Hopefully someone with a larger context window than myself can recall the article in question.
Purely anecdotally, I've found agents have gotten much better at asking clarifying questions, stating that two requirements are incompatible and asking which one to change, and so on.
But when I use Claude code, I also supervise it somewhat closely. I don't let it go wild, and if it starts to make changes to existing tests it better have a damn good reason or it gets the hose again.
The failure mode here is letting the AI manage both the implementation and the testing. May as well ask high schoolers to grade their own exams. Everyone got an A+, how surprising!
I agree, although I think the problem usually comes in writing the spec in the first place. If you can write detailed enough specs the agent will usually give you exactly what you asked for. If you're spec is vague, it's hard to eyeball if the tests or even the implementation of the tests matches what you're looking for.
A very human solution
* I came up with a list of 9 performance improvement ideas for an expensive pipeline. Most of these were really boring and tedious to implement (basically a lot of special cases) and I wasn't sure which would work, so I had Claude try them all. It made prototypes that had bad code quality but tested the core ideas. One approach cut the time down by 50%, I rewrote it with better code and it's saved about $6,000/month for my company.
* My wife and I had a really complicated spreadsheet for tracking how much we owed our babysitter – it was just complex enough to not really fit into a spreadsheet easily. I vibecoded a command line tool that's made it a lot easier.
* When AWS RDS costs spiked one month, I set Claude Code to investigate and it found the reason was a misconfigured backup setting
* I'll use Claude to throw together a bunch of visualizations for some data to help me investigate
* I'll often give Claude the type signature for a function, and ask it to write the function. It generally gets this about 85% right
Agentic programming is a skill-set and a muscle you need to develop just like you did with coding in the past.
Things didn’t just suddenly go downhill after an arbitrary tipping point - what happened is you hit a knowledge gap in the tooling and gave up.
Reflect on what went wrong and use that knowledge next time you work with the agent.
For example, investing the time in building a strong test suite and testing strategy ahead of time which both you and the agent can rely on.
Being able to manage the agent and getting quality results on a large, complex codebase is a skill in itself, it won’t happen over night.
It takes practice and repetition with these tools to level-up, just like any thing else.
I don’t think agentic programming is some promised land of instant code without bugs.
It’s just a force multiplier for what you can do.
- Ask Claude to look at my current in-progress task (from Github/Jira/whatever) and repro the bug using the Chrome MCP.
- Ask it to fix it
- Review the code manually, usually it's pretty self-contained and easy to ensure it does what I want
- If I'm feeling cautious, ask it to run "manual" tests on related components (this is a huge time-saver!)
- Ask it to help me prepare the PR: This refers to instructions I put in CLAUDE.md so it gives me a branch name, commit message and PR description based on our internal processes.
- I do the commit operations, PR and stuff myself, often tweaking the messages / description.
- Clear context / start a new conversation for the next bug.
On a personal project where I'm less concerned about code quality, I'll often do the plan->implementation approach. Getting pretty in-depth about your requirements ovbiously leads to a much better plan. For fixing bugs it really helps to tell the model to check its assumptions, because that's often where it gets stuck and create new bugs while fixing others.
All in all, I think it's working for me. I'll tackle 2-3 day refactors in an afternoon. But obviously there's a learning curve and having the technical skills to know what you want will give you much better results.
You still need to think about how you would solve the problem as an engineer and break down the task into a right-sized chunk of work. i.e. If 4 things need to change, start with the most fundamental change which has no other dependencies.
Also it is important to manage the context window. For a new task, start a new "chat" (new agent). Stay on topic. You'll be limited to about five back-and-forths before performance starts to suffer. (cursor shows a visual indicator of this in the for of the circle/wheel icon)
For larger tasks, tap the Plan button first, and guide it to the correct architecture you are looking for. Then hit build. Review what it did. If a section of code isn't high-quality, tell Claude how to change it. If it fails, then reject the change.
It's a tool that can make you 2 - 10x more productive if you learn to use it well.
For sysops stuff I have found it extremely useful, once it has MCP's into all relevant services, I use it as the first place I go to ask what is happening with something specific on the backend.
Anyone who claims AI is great is not building a large or complex enough app, and when it works for their small project, they extrapolate to all possibilities. So because their example was generated from a prompt, it's incorrectly assumed that any prompt will also work. That doesn't necessarily follow.
The reality is that programming is widely underestimated. The perception is that it's just syntax on a text file, but it's really more like a giant abstract machine with moving parts. If you don't see the giant machine with moving parts, chances are you are not going to build good software. For AI to do this, it would require strong reasoning capabilities, that lets it derive logical structures, along with long term planning and simulation of this abstract machine. I predict that if AI can do this then it will be able to do every single other job, including physical jobs as it would be able to reason within a robotic body in the physical world.
To summarize, people are underestimating programming, using their simple projects to incorrectly extrapolate to any possible prompt, and missing the hard part of programming which involves building abstract machines that work on first principles and mathematical logic.
I can't speak for everyone, but lots of us fully understand that the AI tooling has limitations and realize there's a LOT of work that can be done within those limitations. Also, those limitations are expanding, so it's good to experiment to find out where they are.
Conversely, it seems like a lot of people are saying that AI is worthless because it can't build arbitrarily large apps.
I've recently used the AI tooling to make a docusign-like service and it did a fairly good job of it, requiring about a days worth of my attention. That's not an amazingly complex app, but it's not nothing either. Ditto for a calorie tracking web app. Not the most complex app, but companies are making legit money off them, if you want a tangible measure of "worth".
That might be true for agentic coding (caveat below), but AI in the hands of expert users can be very useful - "great" - in building large and complex apps. It's just that it has to be guided and reviewed by the human expert.
As for agentic coding, it may depend on the app. For example, Steve Yegge's "beads" system is over a quarter million lines of allegedly vibe-coded Go code. But developing a CLI like that may be a sweet spot for LLMs, it doesn't have all the messiness of typical business system requirements.
Is that really a success? I was just reading an article talking about how sloppy and poorly implemented it is: https://lucumr.pocoo.org/2026/1/18/agent-psychosis/
I guess it depends on what you’re looking to get out of it.
In order to better research, I built (ironically, mostly vibe coded) a tool to run structured "self-experiments" on my own usage of AI. The idea is I've init a bunch of hypotheses I have around my own productivity/fulfillment/results with AI-assisted coding. The tool lets me establish those then run "blocks" where I test a particular strategy for a time period (default 2 weeks). So for example, I might have a "no AI" block followed by a "some AI" block followed by a "full agent all-in AI block".
The tool is there to make doing check-ins easier, basically a tiny CLI wrapper around journaling that stays out of my way. It also does some static analysis on commit frequency, code produced, etc. but I haven't fleshed out that part of it much and have been doing manual analysis at the end of blocks.
For me this kind of self-tracking has been more helpful than hearsay, since I can directly point to periods where it was working well and try to figure out why or what I was working on. It's not fool-proof, obviously, but for me the intentionality has helped me get clearer answers.
Whether those results translate beyond a single engineer isn't a question I'm interested in answering and feels like a variant of developer metrics-black-hole, but maybe we'll get more rigorous experiments in time.
The tool open source here (may be bugs, only been using it a few weeks): https://github.com/wellwright-labs/devex
Basically my point of view is that if you don't feel comfortable reviewing your coworkers code, you shouldn't generate code with AI, because you will review it badly and then I will have to catch the bugs and fix it (happened 24 hours ago). If you generate code, you better understand where it can generate side effects.
2. Part of the plan should be automated tests. AI can make these for you too, but you should spot check for reasonable behavior.
3. Use Claude 4.5 Opus
4. Use Git, get the AI to check in its work in meaningful chunks, on its own git branch.
5. Ask the AI to keep am append-only developer log as a markdown file, and to update it whenever its state significantly changes, or it makes a large discovery, or it is "surprised" by anything.
In my org we are experimenting with agentic flows, and we've noticed that model choice matters especially for autonomy.
GPT-5.2 performed much better for long-running tasks. It stayed focused, followed instructions, and completed work more reliably.
Opus 4.5 tended to stop earlier and take shortcuts to hand control back sooner.
One-shotting an application that is very bespoke and niche is not going to go well, and same goes for working on an existing codebase without a pile of background work on helping the model understand it piece by piece, and then restricting it to small changes in well-defined areas.
It's like teaching an intern.
With that in mind, a couple of comments - think of the coding agents as personalities with blind spots. A code review by all of them and a synthesis step is a good idea. In fact currently popular is the “rule of 5” which suggests you need the LLM to review five times, and to vary the level of review, e.g. bugs, architecture, structure, etc. Anecdotally, I find this is extremely effective.
Right now, Claude is in my opinion the best coding agent out there. With Claude code, the best harnesses are starting to automate the review / PR process a bit, but the hand holding around bugs is real.
I also really like Yegge’s beads for LLMs keeping state and track of what they’re doing — upshot, I suggest you install beads, load Claude, run ‘!bd prime’ and say “Give me a full, thorough code review for all sorts of bugs, architecture, incorrect tests, specification, usability, code bugs, plus anything else you see, and write out beads based on your findings.” Then you could have Claude (or codex) work through them. But you’ll probably find a fresh eye will save time, e.g. give Claude a try for a day.
Your ‘duplicated code’ complaint is likely an artifact of how codex interacts with your codebase - codex in particular likes to load smaller chunks of code in to do work, and sometimes it can get too little context. You can always just cat the relevant files right into the context, which can be helpful.
Finally, iOS is a tough target — I’d expect a few more bumps. The vast bulk of iOS apps are not up on GitHub, so there’s less facility in the coding models.
And any front end work doesn’t really have good native visual harnesses set up, (although Claude has the Claude chrome extension for web UIs). So there’s going to be more back and forth.
Anyway - if you’re a career engineer, I’d tell you - learn this stuff. It’s going to be how you work in very short order. If you’re a hobbyist, have a good time and do whatever you want.
Write a good AGENTS.md (or CLAUDE.md) and you'll see that code is more idiomatic. Ask it to keep a changelog. Have the LLM write a plan before starting code. Ask it to ask you questions. Write abstraction layers it (along with the fellow humans of course) can use without messing with the low-level detail every time.
In a way you have to develop a framework to guide the LLM behavior. It takes time.
Just start smaller. I'm not sure why people try to jump immediately to creating an entire app when they haven't even gotten any net-positive results at all yet. Just start using it for small time saving activities and then you will naturally figure out how to gradually expand the scope of what you can use it for.
I am writing an automation software that interfaces with a legacy windows CAD program. Depending on the automation, I just need a picture of the part. Sometimes I need part thickness. Sometimes I need to delete parts. Etc... Its very much interacting with the CAD system and checking the CAD file or output for desired results.
I was considering something that would take screenshots and send it back for checks. Not sure what platforms can do this. I am stumped how Visual Studio works with this, there are a bunch of pieces like servers, agents, etc...
Even a how-to link would work for me. I imagine this would be extremely custom.
Still takes much less time for me to review the plan and output than write the code myself.
So typing was a bottleneck for you? I’ve only found this true when I’m a novice in an area. Once I’m experienced, typing is an inconsequential amount of time. Understanding the theory of mind that composes the system is easily the largest time sink in my day to day.
There are projects where throwing a dozen junior developers at the problem can work but they’re very basic CRUD type things.
As in "Please write just this one for me". Even still, I take care to review each line produced. The key is making small changes at a time.
Otherwise, I type out and think about everything being done when in ‘Flow State’. I don't like the feeling of vibe coding for long periods. It completely changes the way work is done, it takes away agency.
On a bit of a tangent, I can't get in Flow State when using agents. At least not as we usually define it.
- Cleaner code - Easily 5x speed minimum - Better docs, designs - Focus more on the product than than the mechanics - More time for family
https://www.linkedin.com/pulse/concrete-vibe-coding-jorge-va...
The bottom line is this:
* The developer stop been a developer, and becomes a product designer with high technical skills.
* This is a different set of skills than than a developer or a product owner currently have. It is a mix of both, and the expectations of how agentic development works need to be adjusted.
* Agents will behave like junior developers, they can type very fast, and produce something that has a high probability to work. They priority will be to make it work, not maintainability, scalability, etc. Agents can achieve that if you detail how to produce it. * The working with an agent feels more like mentoring the AI than ask and receive.
* When I start to work on a product that will be vibe coded, I need to have clear in my head all the user stories, code architecture, the whole system, then I can start to tell the agent what to build, and correct and annotate in the md files the code quality decisions so it remembers them.* Use TDD, ask the agent to create the tests, and then code to the test. Don't correct the bugs, make the agent correct them and explain why that is a bug, specially with code design decisions. Store those in AGENTS.md file at the root of the project.
There are more things that can be done to guide the agent, but I need to have clear in an articulable way the direction of the coding. On the other side, I don't worry about implementation details like how to use libraries and APIs that I am not familiar with, the agent just writes and I test.
Currently I am working on a product and I can tell you, working no more than 10 hours a week (2 hours here, 3 there, leave the agent working while I am having dinner with family) I am progressing at I would say 5 to 10 times faster than without it. So, yeah it works, but I had to adjust how I do my job.
The only positive antigenic coding experience I had was using it as a "translator" from some old unmaintained shell + C code to Go.
I gave it the old code, told it to translate to Go. I pre-installed a compiled C binary and told it to validate its work using interop tests.
It took about four hours of what the vibecoding lovers call "prompt engineering" but at the end I have to admit it did give me a pretty decent "translation".
However for everything else I have tried (and yes, vibecoders, "tried" means very tightly defined tasks) all I have ever got is over-engineered vibecoding slop.
The worst part of of it is that because the typical cut-off window is anywhere between 6–18 months prior, you get slop that is full of deprecated code because there is almost always a newer/more efficient way to do things. Even in languages like Go. The difference between an AI-slop answer for Go 1.20 and a human coded Go 1.24/1.25 one can be substantial.
Feed it little tasks (30 s-5 min) and if you don't like this or that about the code it gives you either tell it something like
Rewrite the selection so it uses const, ? and :
or edit something yourself and say I edited what you wrote to make it my own, what do you think about my changes?
If you want to use it as a junior dev who gets sent off to do tickets and comes back with a patch three days later that will fail code review be my guest, but I greatly enjoy working with a tight feedback loop.I don’t know what I do differently, but I can get Cursor to do exactly what I want all the time.
Maybe it’s because it takes more time and effort, and I don’t connect to GitHub or actual databases, nor do I allow it to run terminal commands 99% of the time.
I have instructions for it to write up readme files of everything I need to know about what it has done. I’ve provided instructions and created an allow list of commands so it creates local backups of files before it touches them, and I always proceed through a plan process for any task that is slightly more complicated, followed by plan cleanup, and execution. I’m super specific about my tech stack and coding expectations too. Tests can be hard to prompt, I’ll sometimes just write those up by hand.
Also, I’ve never had to pay over my $60 a month pro plan price tag. I can’t figure out how others are even doing this.
At any rate, I think the problem appears to be the blind commands of “make this thing, make it good, no bugs” and “this broke. Fix!” I kid you not, I see this all the time with devs. Not at all saying this is what you do, just saying it’s out there.
And “high quality code” doesn’t actually mean anything. You have to define what that means to you. Good code to me may be slop to you, but who knows unless it is defined.
Otherwise, they are bad.
i.e. You are asking a question about whether using agents to write code is net-positive, and then you go on about not reviewing the code agents produce.
I suspect agents are often net-positive AND one has to review their code. Just like most people's code.
If you are continually accumulating technical debt due to an over-enthusiastic junior developer (or agent) churning out a lot of poorly-conceived code, then the recurring costs will sink you in the long run
Huh ? Every new PR is new code which is a new cost ?
I think in most cases the speed at which AI can produce code outweighs technical debt, etc.
If agentic coding worked as well as people claimed on large codebases I would be seeing a massive shift at my Job... Im really not seeing it.
We have access to pretty much all the latest and greatest internally at no cost and it still seems the majority of code is still written and reviewed by people.
AI assisted coding has been a huge help to everyone but straight up agentic coding seems like it does not scale to these very large codebases. You need to keep it on the rails ALL THE TIME.
I still mostly write my own code and I’ve seen our claude code usage and me just asking it questions and generating occasional boilerplate and one-off scripts puts me in the top quartile of users. There are some people who are all in and have it write everything for them but it doesn’t seem like there’s any evidence they’re more productive.
Now in terms of using AI, the key is to view yourself as a technical lead, not a people manager. You don't stop coding completely or treat underlying frameworks as a black box, you just do less of it. But at some point fixing a bug yourself is faster than writing a page of text explaining exactly how you want it fixed. Although when you don't know the programming language, giving pseudocode or sample code in another language can be super handy.