Like if I'm not ready to jump on some AI-spiced up special IDE, am I then going to just be left banging rocks together? It feels like some of these AI agent companies just decided "Ok we can't adopt this into the old IDE's so we'll build a new special IDE"?_Or did I just use the wrong tools (I use Rider and VS, and I have only tried Copilot so far, but feel the "agent mode" of Copilot in those IDE's is basically useless).
If you read someone say “I don’t know what’s the big deal with vim, I ran it and pressed some keys and it didn’t write text at all” they’d be mocked for it.
But with these tools there seems to be an attitude of “if I don’t get results straight away it’s bad”. Why the difference?
I get the same change applied multiple times, the agent having some absurd method of applying changes that conflict with what I say it like some git merge from hell and so on. I can't get it to understand even the simplest of contexts etc.
It's not really that the code it writes might not work. I just can't get past the actual tool use. In fact, I don't think I'm even at the stage where the AI output is even the problem yet.
>I get the same change applied multiple times, the agent having some absurd method of applying changes that conflict with what I say it like some git merge from hell and so on. I can't get it to understand even the simplest of contexts etc.
That is weird. results have a ton of variation, but not that much.
Say you get a claude subscription, point it to a relatively self contained file in your project, hand it the command to run relevant tests, and tell it to find quick win refactoring opportunities, making sure that the business outcome of the tests is maintained even if mocks need to change.
You should get relevant suggestions for refactoring, you should be able to have the changes applied reasonably, you should have the tests passing after some iterations of running and fixing by itself. At most you might need to check that it doesn't cheat by getting a false positive in a test or something similar.
Is such an exercise not working for you? I'm genuinely curious.
Sure it can, because nobody is reading manuals anymore :).
It's an interesting exercise to try: take your favorite tool you use often (that isn't some recent webshit, devoid of any documentation), find a manual (not a man page), and read it cover to cover. Say, GDB or Emacs or even coreutils. It's surprising just how much powerful features good software tools have, and how much you'll learn in short time, that most software people don't know is possible (or worse, decry as "too much complexity") just because they couldn't be arsed to read some documentation.
> I just can't get past the actual tool use. In fact, I don't think I'm even at the stage where the AI output is even the problem yet.
The tools are a problem because they're new and a moving target. They're both dead simple and somehow complex around the edges. AI, too, is tricky to work, particularly when people aren't used to communicating clearly. There's a lot of surprising problems (such as "absurd method of applying changes") that come from the fact that AI is solving a very broad class of problems, everywhere at the same time, by virtue of being a general tool. Still needs a bit of and-holding if your project/conventions stray away from what's obvious or popular in particular domain. But it's getting easier and easier as months go by.
FWIW, I too haven't developed a proper agentic workflow with CLI tools for myself just yet; depending on the project, I either get stellar results or garbage. But I recognize this is only a matter of time investment: I didn't have much time to set aside and do it properly.
AI is supposed to make our work easier.
I think so. The humans should be writing the spec. The AI can then (try to) make the tests pass.
I feel like that matters more than the tooling at this point.
I can't really understand letting LLMs decide what to test or not, they seem to completely miss the boat when it comes to testing. Half of them are useless because they duplicate what they test, and the other half doesn't test what they should be testing. So many shortcuts, and LLMs require A LOT of hand-holding when writing tests, more so than other code I'd wager.
LLMs just fail (hallucinate) in less known fields of expertise.
Funny: Today I have asked Claude to give me syntax how to run Claude Code. And its answer was totally wrong :) So you go to documentation… and its parts are obsolete as well.
LLM development is in style “move fast and break things”.
So in few years there will be so many repos with gibberish code because “everybody is coder now” even basketball players or taxi drivers (no offense, ofc, just an example).
It is like giving F1 car to me :)
There's obviously a whole heap of hype to cut through here, but there is real value to be had.
For example yesterday I had a bug where my embedded device was hard crashing when I called reset. We narrowed it down to the tool we used to flash the code.
I downloaded the repository, jumped into codex, explained the symptoms and it found and fixed the bug in less than ten minutes.
There is absolutely no way I'd of been able to achieve that speed of resolution myself.
- I downloaded the repository, jumped into codex, explained the symptoms and it found and fixed the bug in less than ten minutes.
Change the second step to: - I downloaded the repository, explained the symptoms, copied the relevant files into Claude Web and 10 minutes later it had provided me with the solution to the bug.
Now I definitely see the ergonomic improvement of Claude running directly in your directory, saving you copy/paste twice. But in my experience the hard parts are explaining the symptoms and deciding what goes into the context.
And let's face it, in both scenarios you fixed a bug in 10-15 minutes which might have taken you a whole hour/day/week before. It's safe to say that LLMs are an incredible technological advancement. But the discussion about tooling feels like vim vs emacs vs IDEs. Maybe you save a few minutes with one tool over the other, but that saving is often blown out of proportion. The speedup I gain from LLMs (on some tasks) is incredible. But it's certainly not due to the interface I use them in.
Also I do believe LLM/agent integrations in your IDE are the obvious future. But the current implementations still add enough friction that I don't use them as daily drivers.
Once I started working this way however, I found myself starting to adapt to it.
It's not unusual now to find myself with at least a couple of simultaneous coding sessions, which I couldn't see myself doing with the friction that using Claude Web/Codex web provides.
I also entirely agree that there's going to be a lot of innovation here.
IDEs imo will change to become increasingly focused on reading/reviewing code rather than writing, and in fact might look entirely different.
I envy you for that. I'm not there yet. I also notice that actually writing the code helps me think through problems and now I sometimes struggle because you have to formulate problems up front. Still have some brain rewiring to do :)
"I can literally feel competence draining out of my fingers"
What exactly do you mean with "integrating agents" and what did you try?
The simplest (and what I do) is not "integrating them" anywhere, but just replace the "copy-paste code + write prompt + copy output to code" with "write prompt > agent reads code > agent changes code > I review and accept/reject". Not really "integration" as much as just a workflow change.
I don't really get how the workflow is supposed to work, but I think it's mostly due to how the tool is made. It has like some sort of "change stack" similar to git commits/staging but which keeps conflicting with anything I manually edit.
Perhaps it's just this particular implementation (Copilot integration in VS) which is bad, and others are better? I have extreme trouble trying to feed it context, handling suggested AI changes without completely corrupting the code for even small changes.
The workflow I have right now, is something like what I put before, and I do it with Codex and Claude Code, both work the same. Maybe try out one of those, if you're comfortable with the terminal? It basically opens up a terminal UI, can read current files, you enter a prompt, wait, then can review the results with git or whatever VCS you use.
But I'm also never "vibe-coding", I'm reviewing every single line, and mercilessly ask the agent to refactor whenever the code isn't up to my standards. Also restart the agent after each prompt finished, as they get really dumb as soon as context is used more than 20% of their "max".
Try one of the CLIs. That’s the good stuff right now. Claude Code (or similar) in your shell, don’t worry about agentic patterns, skills, MCP, orchestrators, etc etc. Just the CLI is plenty.
You should use claude code.
Let me select lines in my code which you are allowed to edit in this prompt and nothing else, for these "add a function that does x" without starting to run amok
Now it's "please add one unit test for Foobar()" and it goes away and thinks for 2 minues and does nothing then I point it to where the FooBar() which it didn't find and then adds a test method then I change the name to one I like better but now the AI change wasn't "accepted"(?) so the thing is borked...
I think the UX for agents is important and ...this can't be it.
It is like learning to code itself. You need flight hours.
But it's good to hear that it's not me being completely dumb, it's Copilot Agent Mode tooling that is?
Consider as an example, that "Clean Code" used to be gospel, now it's mostly considered a book of antipatterns, and many developers prefer to follow Ousterhout instead of Uncle Bob. LLMs "read" both Clean Code and A Philosophy of Software Design, but without prompting they won't know which way you prefer things, so they'll synthesize something more-less in between these two near-complete opposites, mostly depending on the language they're writing code in.
The way I think about it is: "You are a staff software engineer with 15 years of experience in <tech stack used in the project>" is doing 80% of the job, by pulling in specific regions in the latent space associated with good software engineering. But the more particular you are about style, or the more your project deviates from what's the most popular practice across any dimension (whether code style or folder naming scheme or whatnot), the more you need to describe those deviations in your prompt - otherwise you'll be fighting the model. And then, it's helpful to describe any project-specific knowledge such as which tools you're using (VCS, testing framework, etc.), where the files are located, etc. so the model doesn't have to waste tokens discovering it on its own.
Prompts are about latent space management. You need to strengthen associations you want, and suppress the ones you don't. It can get wordy at times, for the same reason explaining some complex thought to another person often takes a lot of words. First sentence may do 90% of the job, but the remaining 20 sentences are needed to narrow down on a specific idea.
How much more depends on what you're trying to do and in what language (e.g. "favourite" pet peeve: Claude occasionally likes to use instance_variable_get() in Ruby instead of adding accessors; it's a massive code smell), but there are some generic things, such as giving it instructions on keeping notes and giving them subagents to farm out repetitive tasks to prevent the individual task completion from filling up the context for tasks that are truly independent (in which case, for Claude Code at least, you can also tell it to do multiple in parallel)
But, indeed, just starting Claude Code (or Codex; I prefer Claude but it's a "personality thing" - try tools until you click with one) and telling it to do something is the most important step up from a chat window.
You could still overload with too many skills but it helps at least.
That's exactly the point. Agents have their own context.
Thus, you try to leverage them by combining ad-hoc instructions for repetitive tasks (such as reviewing code or running a test checklist) and not polluting your conversation/context.
I'd rather use more of them that are brief and specialized, than try to over-correct on having a single agent try to "remember" too many rules. Not really because the description itself will eat too much context, but because having the sub-agent work for too long will accumulate too much context and dilute your initial instructions anyway.
And then there's Ralph with cross LLM consensus in a loop. It's great.
It's a new ecosystem with its own (atrocious!) jargon that you need to learn. The good news is that it's not hard to do so. It's not as complex or revolutionary as everyone makes it look like. Everything boils down to techniques and frameworks of collecting context/prompt before handing it over to the model.
So my 2 cents. Use Claude Code. In Yolo mode. Use it. Learn with it.
Whenever I post something like this I get a lot of downvots. But well ... end of 2026 we will not use computer the way we use them now. Claude Code Feb 2025 was the first step, now Jan 2026 CoWork (Claude Code for everyone else) is here. It is just a much much more powerful way to use computers.
I think it will take much longer than that for most people, but I disagree with the timeline, not where we're headed.
I have a project now where the entirety of the project fall into these categories:
- A small server that is geared towards making it easy to navigate the reports the agents produce. This server is 100% written by Claude Code - I have not even looked at it, nor do I have any interest in looking at it as it's throwaway.
- Agent definitions.
- Scripts written by the agents for the agents, to automate away the parts where we (well, the agents mostly) have found a part of the task is mechanical enough to either take Claude out of the loop entirely, or produce a script that does the mechanical part interspersed with claude --print for smaller subtasks (and then systematically try to see if sonnet or haiku can handle the tasks). Eventually I may get to a point of starting to optimise it to use API's for smaller, faster models where they can handle the tasks well enough.
The goal is for an increasing proportion of the project to migrate from the second part (agent definitions) to the third part, and we do that in "production" workflows (these aren't user facing per se, but third parties do see the outputs).
That is, I started with a totally manual task I was carrying out anyway, defined agents to take over part of the process and produce intermediate reports, had it write the UI that lets me monitor the agents progress, then progressively I'd ask the agent after each step to turn any manual intervention into agents, commands, and skills, and to write tools to handle the mechanical functions we identified.
For each iteration, more stuff first went into the agent definitions, and then as I had less manual work to do, some of that time has gone into talking to the agent about which sub-tasks we can turn into scripts.
I see myself doing this more and more, and often "claude" is now the very first command I run when I start a new project whether it is code related or not.
Claude Code is the secret.
Claude Code is the question and the answer.
Claude Code has already revolutionized this industry. Some of you are just too blind to see it yet.
Am I right in assuming that the people who use AI agent software use them in confined environments like VMs with tight version control?
Then it makes sense but the setup is not worth the hassle for me.
About 99% of the blogs [written by humans] that reach HN's front page are fundamentally incorrect. It's mostly hot takes by confident neophytes. If it's AI-written, it actually comes close to factual. The thing you don't like is usually right, the thing you like is usually wrong. And that's fine if you'd rather read fiction. Just know what you're getting yourself into.
I am ceaselessly fascinated by how we can all live in the same world yet seemingly inhabit such vastly different realities.
Scaled GitHub stars to 20,000+
Built engaged communities across platforms (2.8K X, 5.4K LinkedIn, 700+ YouTube)
etc, etc.
No doubt impressive to marketing types but maybe a pinch of salt required for using AI Agents in production.
The other day we were discussing a new core architecture for a Microservice we were meant to split out of a "larger" Microservice so that separate teams could maintain each part.
Instead of just discussing it entirely without any basis, I instead made a quick prototype via explicit prompts telling the LLM exactly what to create, where etc.
Finally, I asked it to go through the implementation and create a wiki page, concatting the code and outlining in 1-4 sentences above each "file" excerpt what the goal for the file is.
In the end, I went through it to double-check if it held up from my intentions - which it did and thus didn't change anything
Now we could all discuss the pros and cons of that architecture while going through it, and the intro sentence gave enough context to each code excerpt to improve understanding/reduce mental load as necessary context was added to each segment.
I would not have been able to allot that time to do all this without an LLM - especially the summarization to 1-3 sentences, so I'll have to disagree when you state this generally.
Though I definitely agree that a blog article like this isn't worth reading if the author couldn't even be arsed to write it themselves.
One of the better ones were "Unified LLM Interaction Model (ULIM)". You read it here first...
In one week, I fine-tuned https://github.com/kstenerud/bonjson/ for maximum decoding efficiency and:
* Had Claude do a go version (https://github.com/kstenerud/go-bonjson), which outperforms the JSON codec.
* Had Claude do a Rust version (https://github.com/kstenerud/rs-bonjson), which outperforms the JSON codec.
* Had Claude do a Swift version (https://github.com/kstenerud/swift-bonjson), which outperforms the JSON codec (although this one took some time due to the Codable, Encoder, Decoder interfaces).
* Have Claude doing a Python version with Rust underpinnings (making this fast is proving challenging)
* Have Claude doing a Jackson version (in progress, seems to be not too bad)
In ONE week.
This would have taken me a year otherwise, getting the base library going, getting a test runner going for the universal tests, figuring out how good the SIMD support is and what intrinsics I can use, what's the best tooling for hot path analysis, trying various approaches, etc etc. x5.
Now all I do is give Claude a prompt, a spec, and some hand-holding for the optimization phase (admittedly, it starts off at 10x slower, so you have to watch the algorithms it uses). But it's head-and-shoulders above what I could do in the last iteration of Claude.
I can experiment super quickly: Try caching previously encountered keys and show me the performance change. 5 mins, done. Would take me a LOT longer to retool the code just for a quick test. Experiments are dirt cheap now.
The biggest bottleneck right now is that I keep hitting my token limits 1-2 hours before each reset ;-)
The pipe dream of agents handling Github Issue -> PullRequest -> Resolve Issue becomes a nightmare of fixing downstream regressions or other chaos unleashed by agents given too much privilege. I think people optimistic on agents are either naive or hype merchants grifting/shilling.
I can understand the grinning panic of the hype merchants because we've collectively shovelled so much capital into AI with very little to show for it so far. Not to say that AI is useless, far from it, but there's far more over-optimism than realistic assessment of the actual accuracy and capabilities.
So with the top performers I think what's most effective is just stating clearly what the end result you want to be (with maybe some hints for verification of results which is just clarifying the intent more)
Already a "no", the bottleneck is "drowning under your own slop". Ever noticed how fast agents seems to be able to do their work in the beginning of the project, but the larger it grows, it seems to get slower at doing good changes that doesn't break other things?
This is because you're missing the "engineering" part of software engineering, where someone has to think about the domain, design, tradeoffs and how something will be used, which requires good judgement and good wisdom regarding what is a suitable and good design considering what you want to do.
Lately (last year or so), more client jobs of mine have basically been "Hey, so we have this project that someone made with LLMs, they basically don't know how it works, but now we have a ton of users, could you redo it properly?", and in all cases, the applications have been built with zero engineering and with zero (human) regards to design and architecture.
I have no yet have any clients come to me and say "Hey, our current vibe-coders are all busy and don't have time, help us with X", it's always "We've built hairball X, rescue us please?", and that to me makes it pretty obvious what the biggest bottleneck with this sort of coding is.
Moving slower is usually faster long-term granted you think about the design, but obviously slower short-term, which makes it kind of counter-intuitive.
Like an old mentor of mine used to say:
“Slow is smooth; smooth is fast”
I've flagged it, that's what we should be doing with AI content.
[1]: https://kerrick.blog/articles/2025/use-ai-to-stand-in-for-a-...
Bullet point lists! Cool infographics! Foreign words in headings! 93 pages of problem statement -> solution! More bullet points as tradeoffs breakdown! UPDATED! NEW!
How you know something is done either by a grifter or a starving student looking for work.
- Generate a stable sequence of steps (a plan), then carry it out. Prevents malicious or unintended tool actions from altering the strategy mid-execution and improves reliability on complex tasks.
- Provide a clear goal and toolset. Let the agent determine the orchestration. Increases flexibility and scalability of autonomous workflows.
- Have the agent generate, self-critique, and refine results until a quality threshold is met.
- Provide mechanisms to interrupt and redirect the agent’s process before wasted effort or errors escalate. Effective systems blend agent autonomy with human oversight. Agents should signal confidence and make reasoning visible; humans should intervene or hand off control fluidly.
If you've ever heard of "continuous improvement", now is the time to learn how that works, and hook that into your AI agents.But scrap that, it's better just thinking about agent patterns from scratch. It's a green field and, unless you consider yourself profoundly uncreative, the process of thinking through agent coordination is going to yield much greater benefit than eating ideas about patterns through a tube.
0: https://arxiv.org/search/?query=agent+architecture&searchtyp...
It literally gets "stuck" and becomes un-scrollable.