Where it does help is when I'm too tired to start figuring out a problem. It's easier to prompt in natural language and get the agent to ask lots of clarifying questions than it is to get stuck in code in the evening after I have worked all day and have lots of other things in my mind.
Every time I actually crack the code open though, it's almost impossible to figure out certain parts of it. Abstractions are all over the place and leakages are the norm, there's no theory of the system because the LLM doesn't theorise, and as soon as the first anti-pattern slips through, subsequent agents pick up on it and amplify it into a set pattern.
Two nights ago I sat down and decided to build a little project that's been on my list for ages: reading images of the 7-segment LED displays on the front of my washer and dryer and turning them into numeric minutes-remaining values I can use in Home Assistant. I have a 10yo raspi with camera pointed at them, and the images are pretty blurry; it's been hooked up to a little web frontend which pulls out the two displays and shows them in a Home Assistant iFrame.
I figured if I can ask a model to do the annoying part of figuring out all the frameworks and that sort of crap. So I asked my agent (I'm using some free agents that are pretty decent - Nemotron Ultra from OpenRouter and Big Pickle from OpenCode Zen) to build me an OpenCV classifier to try to read the digits. I asked it to write me a labeling UI, ran some loads of laundry and captured a couple hundred images and labeled them manually. Then I had it try to build a template-based classifier using some basic techniques - I didn't really give it much guidance other than general parameters, and it put together something that looked pretty sophisticated, and it claimed 100% accuracy, which seemed hard to believe. Turns out I forgot to tell it to hold out some sample images...
After some iteration (which felt very similar to conversations I overheard at my desk! I might have actually learned some stuff by osmosis) I gave up on the old-school approach when it was only about 70% accurate, and asked it to train me a CNN model. First one was too simple (worse than the original approach), but the second one is very good. With my already labeled dataset and the previous work that had been done on the classifier, the free model was able to build me my custom model, and deployment scripts, in about half an hour.
I didn't look at any of the code, but I had it build me a bunch of various visualization and tuning UIs. I was basically acting as a PM/TPM/QA engineer, and what I was able to do in a couple evenings is stuff that entire teams used to spend weeks on.
Yup, that is why we are seeing so many production databases being deleted, endless vulnerabilities.
No engineer with proper common sense will grant an agentic AI, API access to the database.
"Ohh but it is ready-only API access", it does not matter. You are still using a public service and your data is being stored elsewhere for training.
Unless you are self-hosting an agentic + LLM solution, it shouldn't have read-only access to a database. This does not affect companies because they just wanna AI to replace engineers everywhere they can.
Definitely!!
It is here to stay, it was poorly made public so now it is widely being used to break into systems forcing companies to depend on it to fight machine with machine.
However, that doesn't mean granting it full access to your cloud environment, and this is what lots of companies are getting wrong.
There is no proper bondary in place, all it needs is a single mistake and there goes your entire enviromment on the positive side, on the negative side your env is now open to the public :)
>Where I am there is zero emphasis on security with agents
This was terrible before AI anyway, agentic AI tools is just exposing what already existed.
Plus, as companies are blindly using AI code generated, there are no measures in place to make sure that code doesn't have vulnerabilities in it either.
It is the perfect storm.
I say small improvement because my experience is that modern Agents are pretty good, so by the time they've handed it back to me to test it, there are usually only one or two remaining issues that I'll discover as we roll it out to Production.
E.g. we give Claude credentials for db - but it's never prod data.
A better workflow would be to let LLMs directly access the same verification tools you use. This allows LLMs to observe failures during the loop and incorporate the info more organically, without giving failures too much attention priority.
The above is based on my own experience. LLMs perform better in a positive context (e.g. constructive thinking, building outward, what to do) than in a negative one (e.g. restrictive thinking, carving context inward, what NOT to do). LLMs themselves are designed to be defensive & negative, but they get easily confused under lots of prohibitive rules. LLMs are good at expansive exploration, but suck at verification and pin-pointing what you want. (I'm not sure whether it's related, but this mantra is also true for image generation using Stable Diffusion)
Often I notice errors trying it out in production. This assumes you trust it with access to the production database. How far are you willing to go?
LLM's are gullible, so you should never give Claude access to anything unless you're okay with it leaking. It might make sense to give it partial access, but that's usually going to be more involved than giving Claude an API key. That key could be exfiltrated.
I just let the agent run - it'll run better diagnostics than I can (misc. git, permission checks, commands with flags I don't remember).
If the process yields an error - it means it can't solve it and I have to step in.
Being desperate and copy pasting the error back in is just foolish procrastination.
The actual body of the article with just passing in your api keys is insane tho.
Some people are borderline afraid to touch their keyboards these days.
You’re absolutely right and getting out of the way is the future.
But still, between the lines the blog seems to want to picture an imaginary AI agent that has somewhat predictable behavior ("if you do X with your agent, you will achieve outcome Y"), which is definitely wrong expectation.
Following the output of agent “thinking” simulation lines up pretty good with what I’ve been doing for 20ish years, but of course I may just be a moron who isn’t good at computers.
This is why LLMs do their best work at "leaf nodes", building on existing infrastructure but not designing new patterns on their own.
LLMs can't introspect, reason, or build internal models of the world. You can get very far without that, but there are some subtle ways it will bite you, and it's a fundamental limitation. Hallucinations are one: they are the feature, not a bug.
One difference is that you can (typically) keep on banging the prompt hammer until the problem stops twitching. That might make you want to delegate more.
That in turn might make you refactor the project with more, larger delegated areas. Increased delegation is one recently-added difference between programming and software engineering.
besides, it doesn't even have to be about writing code; finding a bug is more time consuming than fixing it, so you could at least limit yourself to that
With an LLM I must first understand (usually really just infer and guess) its intention, which is much more difficult.
a chainsaw is a coarse tool and I liken it to vibe coding. you maintain at least some level of control, but the edges are rough and you might slice off more (or less) than you meant to. I want to model my usage more like a table saw, a precision instrument that can make the exact cut just as I planned it
Willy nilly giving an agent more (write) access to figure out a bug ... man you're daydreaming.
LLM's work best when they can call a tool and observe the success/failure of a change. If you're HITL then you're the tool, but the result is the same. only slower.
I'm working on a 2D game (pixi.js) with claudecode, and after I moved some logic into a webworker the LLM created a headless simulation exercise of it and would run this to test performance changes against (or in exploration of an issue), which I was surprised by.
I also created some robust graphs & metrics which were easy to screenshot and upload to claude. this was a HITL but it gave claude a lot more insight into what's actually happening instead of guessing when the browser plays the game and has FPS drop.
LLM's do best when they can see what their code is doing. If you can't remove yourself from that cycle of testing you should at least optimize it so you can give rich errors.
Someone drank the kool-aid.
> It reminds me of the doctor I saw last week at the medical clinic who spends 10% of his time diagnosing the patient and the other 90% stabbing his keyboard - one key at a time - for 10 minutes, only to write 3 sentences.
Correction: a pompous asshole drank the kool-aid.
Just means their interface and workflow is bad and needs to be improved though, not that the doctor needs to be removed from the process altogether.