In a discussion here on HN about why a regulation passed 15 years ago was not as general as it could have been, I speculated [1] that it could be that the technology at the time was not up to handling the general case and so they regulated what was feasible at the time.
A couple hours later I checked the discussion again and a couple people had posted that the technology was up to the general case back then and cheap.
I asked an LLM to see if it could dig up anything on this. It told me it was due to technological limits.
I then checked the sources it cites to get some details. Only one source it cited actually said anything about technology limits. That source was my HN comment.
I mentioned this at work, and a coworker mentioned that he had made a Github comment explaining how he thought something worked on Windows. Later he did a Google search about how that thing worked and the LLM thingy that Google puts at the top of search results said that the thing worked the way he thought it did but checking the cites he found that was based on his Github comment.
I'm half tempted to stop asking LLMs questions of the form "How does X work?" and instead tell them "Give me a list of all the links you would cite if someone asked you how X works?".
Essentially, you're asking the LLM to do research and categorize/evaluate that research instead of just giving you an answer. The "work" of accessing, summarizing, and valuing the research yields a more accurate result.
I love the grounding back to ~“well even a human would be bad at this if they did it the current LLM way.”
Bringing things back to ground truth human processes is something that is surprisingly unnatural for me to do. And I know better, and I preach doing this, and I still have a hard time doing it.
I know far better, but apparently it is still hard for me to internalize that LLMs are not magic.
They have no concept of truth or validity, but the frequency of inputs into their training data provides a kind of psuedo check and natural approximation to truth as long as frequency and relationships in the training data also has some relationship to truth.
For a lot of textbook coding type stuff that actually holds: frameworks, shell commands, regexes, common queries and patterns. There's lots of it out there and generally the more common form is spreading some measure of validity.
My experience though is that on niche topics, sparse areas, topics that humans are likely to be emotionally or politically engaged with (and therefore not approximate truth), or things that are recent and therefore haven't had time to generate sufficient frequency, they can get thrown off. And of course it also has no concept of whether what it is finding or reporting is true or not.
This also explains why they have trouble with genuine new programming and not just reimplementing frameworks or common applications because they lack the frequency based or probabilistic grounding to truth and because the new combinations of libraries and code leads to place of relative sparsity in it's weights that leave them unable to function.
The literature/marketing has taken to calling this hallucination, but it's just as easy to think of it as errors produced by probabilistic generation and/or sparsity.
Not doing this might actually cause bigger problems... Getting first-hand experience or even reputable knowledge about something is extremely expensive compared to gut-checking random info you come across. So the "cheap knowledge" may be worth it on balance.
Instead it often frames the answer as authoritative
You're not trying very hard then. Here, my first try: https://claude.ai/share/ef7764d3-6c5c-4d1a-ba28-6d5218af16e0
LLMs are useful for providing answers to more complex questions where some reasoning or integration of information is needed.
In these cases I mostly agree with the parent commenter. LLMs often come up with plausibly correct answers, then when you ask to cite sources they seem to just provide articles vaguely related to what they said. If you're lucky it might directly address what the LLM claimed.
I assume this is because what LLMs say is largely just made up, then when you ask for sources it has to retroactively try to find sources to justify what it said, and it often fails and just links something which could plausibly be a source to back up it's plausibly true claims.
Thats just as fast (or faster) than the AI overview
To me this seemed like the relevant detail in the first excerpt.
But after more thought I realize you were probably expecting the date of his election to prime minister which is fair! That’s probably what searchers would be looking for.
I wanted to use NotebookLM as a tool to ask back and forth when I was trying to understand stuff. It got the answer 90% right but also added a random format, sounding highly confident as if I asked the spec authors themselves.
It was easy to check the specs when I became suspicious and now my trust, even in "grounded" LLMs, is completely eroded when it comes to knowledge and facts.
A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect.
Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable.
I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM.
Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/...
These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there!
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.
Needs to clearly handle the large diffs they produce - anyone have any ideas
git diff --color-moved=dimmed-zebra
That shows a lot of code that was properly moved/copied in gray (even if it's an insertion). So gray stuff exactly matches something that was there before. Can also be enabled by default in the git config.At first I didn't like the color scheme and replaced it with something prettier, but then I discovered it's actually nice to have it kinda ugly, makes it easier to detect the diffs.
It kind of forces you to always put such data in external files, which is better for code organization anyway.
If it's not necessary for understanding the code, I'll usually even leave this data out entirely when passing the code over.
In Python code I often see Gemini add a second h to a random header file extension. It always feels like the llm is making sure that I'm still paying attention.
Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt.
LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways.
Or maybe someone from XEROX has a better idea how to catch subtly altered numbers?
This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time.
The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it.
You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me.
It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information.
For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list.
It's like it has ADHD and forgets or gets distracted in the middle.
And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing.
I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though.
But obviously sometimes larger refactors aren't easy to implement in bash.
In a recent YouTube interview Karpathy claimed that LLMs have a lot more "working memory" than a human:
https://www.youtube.com/watch?v=hM_h0UA7upI&t=1306s
What I assume he's talking about is internal activations such as stored in KV cache that have same lifetime as tokens in the input, but this really isn't the same as "working memory" since these are tied to the input and don't change.
What it seems an LLM would need to do better at these sort of iterative/sequencing tasks would be a real working memory that had more arbitrary task-duration lifetime and could be updated (vs fixed KV cache), and would allow it to track progress or more generally maintain context (english usage - not LLM) over the course of a task.
I'm a bit surprised that this type of working memory hasn't been added to the transformer architecture. It seems it could be as simple as a fixed (non shifting) region of the context that the LLM could learn to read/write during training to assist on these types of task.
An alternative to having embeddings as working memory is to use an external file of text (cf a TODO list, or working notes) for this purpose which is apparently what Claude Code uses to maintain focus over long periods of time, and I recently saw mentioned that the Claude model itself has been trained to use read/write to this sort of text memory file.
I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it.
I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided.
All the time
// fake data. in production this would be real data
... proceeds to write sometimes hundreds of lines
of code to provide fake data
"sure thing, I'll add logic to check if the real data exists and only use the fake data as a fallback in case the real data doesn't exist"
I made some charts/dashboards in HA and was watching it in the background for a few minutes and then realized that none of the data was changing, at all.
So I went and looked at the code and the entire block that was supposed to pull the data from the device was just a stub generating test data based on my exact mock up of what I wanted the data it generated to look like.
Claude was like, “That’s exactly right, it’s a stub so you can replace it with the real data easily, let me know if you need help with that!” And to its credit, it did fix it to use actual data but I re-read my original prompt was somewhat baffling to think it could have been interpreted as wanting fake data given I explicitly asked it to use real data from the device.
But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works.
Anyone care to wager if anthropic is red teaming in production on paying users?
This was allowed to go to master without "git diff" after Codex was done?
The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'.
Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem.
Luckily I've grown a preference for statically typed, compiled, functional languages over the years, which eliminates an entire class of bugs AND hallucinations by catching them at compile time. Using a language that doesn't support null helps too. The quality of the code produced by agents (claude clode and codex) is insanely better than when I need to fix some legacy code written in a dynamic language. You'll sometimes catch the agent hallucinating and continuously banging it's head against the wall trying to get it's bad code to compile. It seems to get more desperate and may eventually figure out a way to insert some garbage to get it to compile or just delete a bunch of code and paper over it... but it's generally very obvious when it does this as long as you're reviewing. Combine this with git branches and a policy of frequent commits for greatest effect.
You can probably get most of the way there with linters and automated tests with less strict dynamic languages, but... I don't see the point for new projects.
I've even found Codex likes to occasionally make subtle improvements to code located in the same files but completely unrelated to the current task. It's like some form of AI OCD. Reviewing diffs is kind of essential, so using a foundation that reduces the size of those diffs and increases readability is IMO super important.
That said, your comment made me realize I could be using “git apply”more effectively to review LLM-generated changes directly in my repo. It’s actually a neat workflow!
In my view these perfectly serve the purpose of encouraging you to keep burning tokens for immediate revenue as well as potentially using you to train their next model at your expense.
It’s very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that.
> why don´t we stop pretending that we as users are stupid and don´t know how to use them
This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts!
If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this.
Let's do a quick overview of recent chats for me:
* Identifying and validating a race condition in some code
* Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code
* Identifying an async bug two good engineers couldn't find in a codebase they knew well
* Finding performance issues that had gone unnoticed
* Digging through synapse documentation and github issues to find a specific performance related issue
* Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed
* Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money
* Going through funding opportunities and matching them against a charity I want to help in my local area
* Building a search integration for my local library to handle my kids reading challenge
* Solving a series of VPN issues I didn't understand
* Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to.
> the folks pushing or better said
If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well.
> Trust me, it sucks
Ok. I'm convinced.
> and under-delivers.
Compared to what promise?
> I am sure we will see those 10x apps rolling in soon, right?
Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done).
> It's only been like 4 years since the revolutionary magic machine was announced.
It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field.
If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote.
Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question:
Why are you paying for something that solves literally no problems for you?
The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s
Quite frankly, not being able to discuss the pros and the cons of a technology with other engineers absolutely hinders innovation. A lot of discoveries come out of mistakes.
Stop being so small minded.
Perhaps you’ve been sold a lie?
Sadly it seems the best use-case for LLMs at this point is bamboozling humans.
On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours.
Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators.
We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming it’s already writing 70%.
I say they are being sold as a magical do everything tool.
Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model.
"This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce.
But LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model."
https://jeremyberman.substack.com/p/how-i-got-the-highest-sc...The things they're saying are technically correct, the best kind of correct. The models beat human PhDs on certain benchmarks of knowledge and reasoning. They may write 70% of the easiest code in some specific scenario. It doesn't matter. They're useful tools that can make you slightly more productive. That's it.
When you see on tv that 9 out of 10 dentists recommend a toothpaste what do you do? Do you claim that brushing your teeth is a useless hype that's being pushed by big-tooth because they're exaggerating or misrepresenting what that means?
Only after schizophrenic dentists go around telling people that brushing their teeth is going to lead to a post-scarcity Star Trek world.
Did the LLM have this?
You have to be able to see what this thing can actually do, as opposed to what it can’t.
Edit: I think I'm just regurgitating the article here.
> You can never trust the LLM to generate a url
This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs.
This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter.
> I'm sick of hearing "you're doing it wrong"
That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence.
> when the real answer is "this tool can't do that."
That is what they said.
Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing.
> Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed".
But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't.
"You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant".
Unless of course the management says "from now on you will be running with scissors and your performance will increase as a result".
If you expect one shot you will get a lot of bad surprises.
If it was a project written by humans I'd say they were crazy for going so hard on testing.
The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme.
Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead.
I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year.
When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves.
Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful.
The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they.
The failure you're describing is that your coworkers are not doing their job.
And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening.
Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well.
Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7.
Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work.
Can you spot the next problem introduced by this?
The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context.
(BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time)
But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't.
If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years.
Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now.
But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one.
I was running into this constantly on one project with a repo split between a Vite/React front end and .NET backend (with well documented structure). It would sometimes go into panic mode after some npm command didn’t work repeatedly and do all sorts of pointless troubleshooting over and over, sometimes veering into destructive attempts to rebuild whatever it thought was missing/broken.
I kept trying to rewrite the section in CLAUDE.md to effectively instruct it to always first check the current directory to verify it was in the correct $CLIENT or $SERVER directory. But it would still sometimes forget randomly which was aggravating.
I ended up creating some aliases like “run-dev server restart” “run-dev client npm install” for common operations on both server/client that worked in any directory. Then added the base dotnet/npm/etc commands to the deny list which forced its thinking to go “Hmm it looks like I’m not allowed to run npm, so I’ll review the project instructions. I see, I can use the ‘run-dev’ helper to do $NPM_COMMAND…”
It’s been working pretty reliably now but definitely wasted a lot of time with a lot of aggravation getting to that solution.
Perhaps "before implementing a new utility or helper function, ask the not-invented-here tool if it's been done already in the codebase"
Of course, now I have to check if someone has done this already.
I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try.
Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt.
It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and it’s pretty rough compared to when I’m using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is.
We started with building the best code retrieval and build an agent around it.
Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines.
Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?)
I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.
Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.
This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.
I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.
How does the API look completely different for pg and sqlite? Can you share an example?
It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.
Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.
Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.
Maybe I should use the same example repeated for clarity. Let me do that.
Edit: Fixed. Thank you.
Question - this loads a 2 MB JS parser written in Rust to turn `x => x.foo` into `{ op: 'project', field: 'foo', target: 'x' }`. But you don't actually allow any complex expressions (and you certainly don't seem to recursively parse references or allow return uplift, e. g. I can't extract out `isOver18` or `isOver(age: int)(Row: IQueryable): IQueryable`). Why did you choose the AST route instead of doing the same thing with a handful of regular expressions?
I could have allowed (I did consider it) functions external to the expression, like isOver18 in your example. But it would have come at the cost of the parser having to look across the code base, and would have required tinqerjs to attach via build-time plugins. The only other way (without plugins) might be to identify callers via Error.stack, and attempting to find the calling JS file.
The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.
I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".
It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.
We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.
We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...
> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)
But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.
So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.
But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.
You still cannot trust LLMs. And that is a problem.
Obviously generated code drift a little from deleted ones.
I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors.
However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.
AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today.
In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own.
We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason.
Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand.
In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible.
It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve.
Sure, but the food is less nutritious and more toxic.
Many companies were willing to hire fresh college grads in the hopes that they could solve relatively easy problems for a few years, gain experience and become successful senior devs at some point.
However, with the advent of AI dev tools, we are seeing very clear signs that junior dev hiring rates have fallen off a cliff. Our project manager, who has no dev experience, frequently assigns easy tasks/github issues to Github Copilot. Copilot generates a PR in a few minutes that other devs can review before merging. These PRs are far superior to what an average graduate of a code boot camp could ever create. Any need we had for a junior dev has completely disappeared.
Where do your senior devs come from?
That's not overengineering, that's engineering. "Ask clarifying questions before you start working", in my experience, has led to some fantastic questions, and is a useful tool even if you were to not have the AI tooling write any code. As a good programmer, you should know when you are handing the tool a complete spec to build the code and when the spec likely needs some clarification, so you can guide the tool to ask when necessary.
It always asks me questions, and I've always benefited from it. It will subtly point out things I hadn't thought about, etc.
But most of the time, I find that the outputs are nowhere near the effect of just doing it myself. I tried Codex Code the other day to write some unit tests. I had a few setup and wanted to use it (because mocking the data is a pain).
It took about 8 attempts, I had to manually fix code, it couldn't understand that some entities were obsolete (despite being marked and the original service not using them). Overall, was extremely disappointed.
I still don't think LLMs are capable of replacing developers, but they are great at exposing knowledge in fields you might not know and help guide you to a solution, like Stack Overflow used to do (without the snark).
And the human prompting, of course. It takes good sw engineering skills, particularly knowing how to instruct other devs in getting the work done, setting up good AGENTS.md (CLAUDE.md, etc) with codebase instructions, best practices, etc etc.
So it's not an "AI/LLMs are capable of replacing developers"... that's getting old fast. It's more like, paraphrasing the wise "it's not what your LLM can do for you, but what can you do for your LLM"
In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career.
So: "humans are bad at this too" doesn't have much weight (for people with that mindset).
It makes sense to me, at least.
Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that.
Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there.
But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens).
I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky.
If a knife slices bread like a normal human at p50, it's not a very good knife.
If a knife slices bread like a professional chef at p50, it's probably a very decent knife.
I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs.
The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents.
The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard?
There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it.
> And it’s not just how they handle code movement -- their whole approach to problem-solving feels alien too.
It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes.
To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right.
There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation.
So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good.
> To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly.
So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there.
- Get rid of these warnings "...": captures and silences warnings instead of fixing them - Update this unit test to relfect the changes "...": changes the code so the outdated test works - The argument passed is now wrong: catches the exception instead of fixing the argument
My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding...
“Fix the issues causing these warnings”
Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it.
“The argument passed is now wrong” - you’re asking the LLM to infer that there’s a problem somewhere else, and to find and fix it.
When you’re asking an LLM to do something, you have to be very explicit about what you want it to do.
buffer_copy: Copy specific line ranges from files to agent's private buffer
buffer_paste: Insert/append/replace those exact bytes in target files
buffer_list: See what's currently buffered
So the agent can say "copying lines 50-75 from auth.py" and the MCP server handles the actual file I/O. No token generation, no hallucination, byte-for-byte accurate. Doesn't touch your system clipboard either.
The MCP server already included tools to copy AI-generated content to your system clipboard - useful for "write a Python script and copy it" workflows.
(Clippy's main / original purpose is improving on macOS pbcopy - it copies file references instead of just file contents, so you can paste actual files into Slack/email/etc from the terminal.)
If you're on macOS and use Claude or other MCP-compatible agents: https://github.com/neilberkman/clippy
brew install neilberkman/clippy/clippy
Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized.
Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work.
It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart.
There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots.
One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested.
Usually those two groups correlate very well with liking LLMs: some people will ask Claude to create a UI with React and see the mess it generated (even if it mostly works) and the edge cases it left out and comment in forums that LLMs don't work. The other group of people will see the UI working and call it a day without even noticing the subtleties.
Both are right.
How about a full programming language written by cc "in a loop" in ~3 months? With a compiler and stuff?
It might be a meme project, but it's still impressive as hell we're here.
I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool.
Impressive nonetheless.
There's a yt channel where the sessions were livestreamed. It's in their FAQ. I haven't felt the need to check them, but there are 10-12h sessions in there if you're that invested in proving that this is "so far outside of any capabilities"...
A brief look at the commit history should show you that it's 99.9% guaranteed to be written by an LLM :)
When's the last time you used one of these SotA coding agents? They've been getting better and better for a while now. I am not surprised at all that this worked.
This morning :)
>"so far outside of any capabilities"
Anthropic was just bragging last week about being able to code without intervention for 30 hours before completely losing focus. They hailed it as a new bench mark. It completed a project that was 11k lines of code.
The max unsupervised run that GPT-5-Codex has been able to pull off is 7 hours.
That's what I mean by the current SOTA demonstrated capabilities.
https://x.com/rohanpaul_ai/status/1972754113491513481
And yet here you have a rando who is saying that he was able able to get an agent to run unsupervised for 100x longer than what the model companies themselves have been able to do and produce 10x the amount of code--months ago.
I'm 100% confident this is fake.
>There's a yt channel where the sessions were livestreamed.
There are a few videos that long, not 3 months worth of videos. Also I spot checked the videos and it the framerate is so low that it would be trivial to cut out the human intervention.
>guaranteed to be written by an LLM
I don't doubt that it was 99.9% written by an LLM, the question is whether he was able to run unsupervised for 3 months or whether he spent 3 months guiding an LLM to write it.
That would mean that every few hours the agent starts fresh, does the inspect repo thing, does the plan for that session, and so on. That would explain why it took it ~3 months to do what a human + ai could probably do in a few weeks. That's why it doesn't sound too ludicrous for me. If you look at the repo there are a lot of things that are not strictly needed for the initial prompt (make a programming language like go but with genz stuff, nocap).
Oh, and if you look at their discord + repo, lots of things don't actually work. Some examples do, some segfault. That's exactly what you'd expect from "running an agent in a loop". I still think it's impressive nonetheless.
The fact that you are so incredulous (and I get why that is, scepticism is warranted in this space) is actually funny. We are on the right track.
If Anthropic thought they could produce anything remotely useful by wiping the context and reprompting every few hours, they would be doing it. And they’d be saying “look at this we implemented hard context reset and we can now run our agent for 30 days and produce an entire language implementation!”
In 3 months or 300 years of operating like this a current agent being freshly reprompted every few hours would never produce anything that even remotely looked like a language implementation.
As soon as its context was poisoned with slightly off topic todo comments it would spin out into writing a game of life implementation or whatever. You’d have millions of lines of nonsense code with nothing useful after 3 months of that.
The only way I see anything like this doing anything approaching “useful” is if the outer loop wipes the repo on every reset as well, and collects the results somewhere the agent can’t access. Then you essentially have 100 chances to one shot the thing.
But at that point you just have a needlessly expensive and slow agent.
Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched.
Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive.
Absolutely. I do not underestimate this.
What does that mean exactly? I assume the LLM was not left alone with its task for 3 months without human supervision.
> the following prompt was issued into a coding agent:
> Hey, can you make me a programming language like Golang but all the lexical keywords are swapped so they're Gen Z slang?
> and then the coding agent was left running AFK for months in a bash loop
Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months.
I looked through the videos listed as “facts” to support the claims and I don’t see anything longer than a few hours.
I've worked with both and finished my Vim syntax highlighters down to the keywords.
And getting them to find 'stmt', 'expr_stmt', and 'primary_stmt_expr' semantic production rules (one is Bison-generated .y file, other is hand-rolled). Both makes too much assumptions despite explicitly instructing them to do "verification & validation" of a pathway given a sample statement.
Only Google Gemini barely cut the mustard.
Another case is making assumptions (upon grilling about its assumption, I've since learned that it was looking at old websites, archiac info). Asking to stick with latest nftables v1.1.4 (or even v1.1.5 head) does not help because old webpages gave obsoleted nftables syntax.)
Don't expect LLM to navigate any time soon the S-expression, RECREATE abstract syntax tree of 4-layer or deeper, transition a state machine beyond 8 states, or interpret Bison parsers reliably.
My only regret is that none of them will take the LLM learning from me, the expert so that others may benefit.
On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale.
The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right!
What is really needed is a tree of problems which appear identical at first glance, but the issue and the solution is something that is one of many possibilities which can only be revealed by finding what information is lacking, acquiring that information, testing the hypothesis then, if the hypothesis is shown to be correct, then finally implementing the solution.
That's a much more difficult training set to construct.
The editing issue, I feel needs something more radical. Instead of the current methods of text manipulation, I think there is scope to have a kind of output position encoding for a model to emit data in a non-sequential order. Again this presents another training data problem, there are limited natural sources to work from showing programming in the order a programmer types it. On the other hand I think it should be possible to do synthetic training examples by translating existing model outputs that emit patches, search/replaces, regex mods etc. and translate those to a format that directly encodes the final position of the desired text.
At some stage I'd like to see if it's possible to construct the models current idea of what the code is purely by scanning a list of cached head_embeddings of any tokens that turned into code. I feel like there should be enough information given the order of emission and the embeddings themselves to reconstruct a piecemeal generated program.
I find this one particularly frustrating when working directly with ChatGPT and Claude via their chat interfaces. I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change.
I expect there are reasons this is difficult, but difficult problems usually end up solved in the end.
In such cases, I specifically instruct LLMs to "only show the lines you would change" and they are very good at doing just that and eliding the rest. However, I usually do this after going through a couple of rounds of what you just described :-)
I partly do this to save time and partly to avoid using up more tokens. But I wonder if it is actually saving tokens given that hidden "thinking tokens" are a thing these days. That is, even if they do elide the unchanged code, I'm pretty sure they are "reasoning" about it before identifying only the relevant tokens to spit out.
As such, that does seem different from copy-and-paste tool use, which I believe is also solved. LLMs can already identify when code changes can be made programmatically... and then do so! I have actually seen ChatGPT write Python code to refactor other Python code: https://www.linkedin.com/posts/kunalkandekar_metaprogramming...
I had to fix a minor bug in its Python script to make it work, but it worked and was a bit of a <head-explode> moment for me. I still wonder if this is part of its system prompt or an emergent tool-use behavior. In either case, copy-and-paste seems like a much simpler problem that could be solved with specific prompting.
AI labs already shipped changes related to this problem - most notable speculative decoding, which lets you provide the text you expect to see come out again and speeds it up: https://simonwillison.net/2024/Nov/4/predicted-outputs/
They've also been iterating on better tools for editing code a lot as part of the competition between Claude Code and Codex CLI and other coding agents.
Hopefully they'll figure out a copy/paste mechanism as part of that work.
I decided to pull the source code and fix this myself. It's written in Swift which I've used very little before, but this wasn't gonna be too complex of a change. So I got some LLMs to walk me through the process of building CLI apps in Xcode, code changes that need to be made, and where the build artifact is put in my filesystem so I could try it out.
I was able to get it to compile, navigate to my compiled binary, and run it, only to find my changes didn't seem to work. I tried everything, asking different LLMs to see if they can fix the code, spit out the binary's metadata to confirm the creation date is being updated when I compile, etc. Generally when I'd paste the code to an LLM and ask why it doesn't work it would assert the old code was indeed flawed, and my change needed to be done in X manner instead. Even just putting a print statement, I couldn't get those to run and the LLM would explain that it's because of some complex multithreading runtime gotcha that it isn't getting to the print statements.
After way too much time trouble-shooting, skipping dinner and staying up 90 minutes past when I'm usually in bed, I finally solved it - when I was trying to run my build from the build output directory, I forgot to put the ./ before the binary name, so I was running my global install from the developer and not the binary in the directory I was in.
Sure, rookie mistake, but the thing that drives me crazy with an LLM is if you give it some code and ask why it doesn't work, they seem to NEVER suggest it should actually be working, and instead will always say the old code is bad and here's the perfect fixed version of the code. And it'll even make up stuff about why the old code should indeed not work when it should, like when I was putting the print statements.
Especially when surrounded by people who swear LLMs can really be gamechanging on certain tasks, it's really hard to just keep doing things by hand (especially if you have the gut feeling that an LLM can probably do rote pretty well, based on past experience).
What kind of works for me now is what a colleague of mine calls "letting it write the leaf nodes in the code tree". So long as you take on the architecture, high level planning, schemas, and all the important bits that require thinking - chances are it can execute writing code successfully by following your idiot-proof blueprint. It's still a lot of toll and tedium, but perhaps still beats mechanical labor.
Why do this to yourself? Do you get paid more if you work faster?
The second issue is that, LLM does not learn much high level context relationship of knowledge. This can be improved by introducing more patterns in the training data. And current LLM training is doing much on this. I don't think it is a problem in next years.
You should either already know the answer or have a way to verify the answer. If neither, the matter must be inconsequential like just a child like curiosity. For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 but either answer won't alter any of what I do today.
I suspect some people (who need to read the full report) dump thousand page long reports into LLM, read the first ten words of the response and pretend they know what the report says and that is scary.
Fortunately, as devs, this is our main loop. Write code, test, debug. And it's why people who fear AI-generated code making it's way into production and causing errors makes me laugh. Are you not testing your code? Or even debugging it? Like, what process are you using that prevents bugs happening? Guess what? It's the exact same process with AI-generated code.
For those curious, the answer is 97.
ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested.
It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.)
First it gets an error because bash doesn’t understand \
Then it gets an error because /b doesn’t work
And as LLMs don’t learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files
If it was an actual coworker, we’d send it off to HR
I am guessing this because:
1. Most of the training material online references Unix commands. 2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on.
Side note: Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line.
I’d like to see what happens with better refactoring tools, I’d make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get.
Asking questions is a good point but that’s both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes.
There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area.
> LLMs don’t copy-paste (or cut and paste) code.
The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top.
> Good human developers always pause to ask before making big changes or when they’re unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it.
Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do.
I think we can't trivialize adding good cut/copy/paste tools though. It's not like we can just slap those tools on the topmost layer (ex, on Claude Code, Codex, or Roo) and it'll just work.
I think that a lot of reinforcement learning that LLM providers do on their coding models barely (if at all) steer towards that kind of tool use, so even if we implemented those tools on top of coding LLMs they probably would just splash and do nothing.
Adding cut/copy/paste probably requires a ton of very specific (and/or specialized) fine tuning with not a ton of data to train on -- think recordings of how humans use IDEs, keystrokes, commands issued, etc etc.
I'm guessing Cursor's Autocomplete model is the closest thing that can do something like this if they chose to, based on how they're training it.
Not in my experience. And it's not "overengineering" your prompt, it's just writing your prompt.
For anything serious, I always end every relevant request with an instruction to repeat back to me the full design of my instructions or ask me necessary clarifying questions first if I've left anything unclear, before writing any code. It always does.
And I don't mind having to write that, because sometimes I don't want that. I just want to ask it for a quick script and assume it can fill in the gaps because that's faster.
Many agents break down not because the code is too complex, but because invisible, “boring” infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. you’re fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change.
The canonical products were from JetBrains. I haven’t used Jetbrains in years. But I would be really surprised with the combination of LLMs + a complete understanding of the codebase through static analysis (like it was doing well over a decade ago) and calling a “refactor tool” that it wouldn’t have better results.
[1] before I get “well actuallied” yes I know if you use reflection all bets are off.
Dev tools were not bad at all back then. In a few ways they were better than today, like WYSIWYG GUI design which we have wholly abandoned. Old school Visual Basic was a crummy programming language but the GUI builder was better than anything I’m familiar with for a desktop OS today.
Also if you want it to pause asking questions, you need to offer that thru tools (example Manus do that) and I have an MCP that do that and surprisingly I got a lot of questions and if you prompt, it will do. But the push currently is for full automation and that's why it's not there. We are far better in supervised step by step mode. There is elicitation already in MCP, but having a tool asking questions require you have a UI that will allow to set the input back.
The second point is easily handled with proper instructions. My AI agents always ask questions about points I haven't clarified, or when they come across a fork in the road. Frequently I'll say "do X" and it'll proceed, then halfway it will stop and say "I did some of this, but before I do the rest, you need to decide what to do about such and such". So it's a complete non-problem for me.
I often wish that instead of just starting to work on the code, automatically, even if you hit enter / send by accident, the models would rather ask for clarification. The models assume a lot, and will just spit out code first.
I guess this is somewhat to lower the threshold for non-programmers, and to instantly give some answer, but it does waste a lot of resources - I think.
Others have mentioned that you can fix all this by providing a guide to the mode, how it should interact with you, and what the answers should look like. But, still, it'd be nice to have it a bit more human-like on this aspect.
Even something as simple as renaming a variable is often safer and easier when done through the editor’s language server integration.
I hear from my clients (but have not verified myself!) that LLMs perform much better with a series of tiny, atomic changes like Replace Magic Literal, Pull Up Field, and Combine Functions Into Transform.
[1] https://martinfowler.com/books/refactoring.html [2] https://martinfowler.com/bliki/OpportunisticRefactoring.html [3] https://refactoring.com/catalog/
With a rewrite usually implying starting from scratch — whether small or large — replacing existing implementations (of functions/methods/modules/whatever), with newly created ones.
Indeed one can refactor a large codebase, without actually rewriting much- if anything at all- of substance.
Maybe one could claim that this is actually lots of micro-refactors — but that doesn't flow particularly well in communication — and if the sum total of it is not specifically a "rewrite", then what collective / overarching noun should be used for the sum total of the plurality of all of these smaller refactorings? — If one spent time making lots of smaller changes, but not actually re-implementing anything... to me, that's not a rewrite, the code has been refactored, even if it is a large piece of code with a lot of structural changes throughout.
Perhaps part of the issue here in this context, is that LLMs don't particularly refactor code anyhow, they generally rewrite (regenerate) it. Which is where many of the subtle issues that are described in other comments here, creep in. The kinds of issues that a human wouldn't necessarily create when refactoring (e.g. changed regex, changed dates, other changes to functionality, etc)
When left to its own devices on tasks with little existing reference material to draw from, however, the quality and consistency suffers significantly and brittle, convoluted structures begin to emerge.
This is just my limited experience though, and I almost never attempt to, for example, vibe-code an entire greenfield mvp.
"Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it".
I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights.
A good intern is really valuable. An army of good interns is even more valuable. But interns are still interns, and you have to check their work. Carefully.
Ask the average high school or college student and I doubt they would fare better.
—
Just because this new contributor is forced to effectively “SSH” into your codebase and edit not even with vim but with with sed and awk does not mean that this contributor is incapable of using other tools if empowered to do so. The fact it is able to work within such constraints goes to show how much potential there is. It is already much better at a human than erasing the text and re-typing it from memory and while it is a valid criticism that it needs to be taught how to move files imagine what it is capable of once it starts to use tools effectively.
—
Recently, I observed LLMs flail around for hours trying to get our e2e tests running as it tried to coordinate three different processes in three different terminals. It kept running commands in one terminal try to kill or check if the port is being used in the other terminal.
However, once I prompted the LLM to create a script for running all three processes concurrently, it is able to create that script, leverage it, and autonomously debug the tests now way faster than I am able to. It has also saved any new human who tries to contribute from similar hours of flailing around. Is there something we could have easily done by hand but just never had the time to do before LLMs. If anything, the LLM is just highlighting the existing problem in our codebase that some of us got too used to.
So yes, LLMs makes stupid mistakes, but so do humans, the thing is that LLms can ifentify and fix them faster (and better, with proper steering)
Over time, they usually get what they want: they become the smartest ones left in the room, because all the good people have already moved on. What’s left behind is a codebase no one wants to work on, and you can’t hire for it either.
But maybe I’ve just worked with the wrong teams.
EDIT: Maybe this is just about trust. If you can’t bring yourself to trust code written by other human beings, whether it’s a package, a library, or even your own teammates, then of course you’re not going to trust code from an LLM. But that’s not really about quality, it’s about control. And the irony is that people who insist on controlling every last detail usually end up with fragile systems nobody else wants to touch, and teams nobody else wants to join.
Llms provide little of that, they make people lazy, juniors stay juniors forever, even degrading mentally in some aspects. People need struggle to grow, when you have somebody who had his hand held whole life they are useless human disconnected from reality, unable to self-sufficiently achieve anything significant. Too easy life destroys both humans and animals alike (many experiments have been done on that, with damning results).
There is much more like hallucinations, questionable added value of stuff that confidently looks OK but has underlying hard-to-debug bugs but above should be enough for a start.
I suggest actually reading those conversations, not just skimming through them, this has been stated countless times.
Often I find myself cursing at the LLM for not understanding what I mean - which is expensive in lost time / cost of tokens.
It is easy to say: Then just don't use LLMs. But in reality, it is not too easy to break out of these loops of explaining, and it is extremely hard to assess when not to trust that the LLM will not be able to finish the task.
I also find that LLMs consistently don't follow guidelines. Eg. to never use coercions in TypeScript (It always gets in a rogue `as` somewhere) - to which I can not trust the output and needs to be extra vigilant reviewing.
I use LLMs for what they are good at. Sketching up a page in React/Tailwind, sketching up a small test suite - everything that can be deemed a translation task.
I don't use LLMs for tasks that are reasoning heavy: Data modelling, architecture, large complex refactors - things that require deep domain knowledge and reasoning.
Me too. But in all these cases, sooner or later, I realized I made a mistake not giving enough context and not building up the discussion carefully enough. And I was just rushing to the solution. In the agile world, one could say I gave the LLM not a well-defined story, but a one-liner. Who is to blame here?
I still remember training a junior hire who started off with:
“Sorry, I spent five days on this ticket. I thought it would only take two. Also, who’s going to do the QA?”
After 6 months or so, the same person was saying:
“I finished the project in three weeks. I estimated four. QA is done. Ready to go live.”
At that point, he was confident enough to own his work end-to-end, even shipping to production without someone else reviewing it. Interestingly, this colleague left two years ago, and I had to take over his codebase. It’s still running fine today, and I’ve spent maybe a single day maintaining it in the last two years.
Recently, I was talking with my manager about this. We agreed that building confidence and self-checking in a junior dev is very similar to how you need to work with LLMs.
Personally, whenever I generate code with an LLM, I check every line before committing. I still don’t trust it as much as the people I trained.
That is not really relevant, is it? The LLM is not a human.
The question is whether it is still af efficient to use LLMs after spending huge amounts of time giving the context - or if it is just as efficient to write the code yourself.
> I still remember training a junior hire who started off with
Working with LLMs is not training junior developers - treating it as such is yet another resource sink.
Meanwhile, I think I will stick to get tangible performance benefits out if its usage.
If an llm can't do sys admin stuff reliably, why do we think it can write quality code?
If the code-change is something you would reasonably prefer to use a codemod to implement (i.e. dozens-to-hundreds of small changes fitting a semantic pattern), Claude Code not going to be able to make that change effectively.
However (!), CC is pretty good at writing the codemod.
For copy-paste, you made it feel like a low-hanging fruit? Why don't AI agents have copy/paste tools?
why overengineer? it's super simple
I just do this for 60% of my prompts: "{long description of the feature}, please ask 10 questions before writing any code"
Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use.
I do not believe that AI will magically overcome the Chesterton Fence problem in a 100% autonomous way.
LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering.
But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan.
Then you - and your agent - can refactor fearlessly.
So there's hope.
But often they just delete and recreate the file, indeed.
Maybe some of those character.ai models are sassy enough to have stronger opinions on code?
LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself.
It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data.
Typo or trolling the next LLM to index HN comments?
To me, I think I'm fine just accepting them for what they're good at. I like them for generating small functions, or asking questions about a really weird error I'm seeing. I don't ever ask it to refactor things though, that seems like a recipe for disaster and a tool that understands the code structure is a lot better for moving things around then an LLM is.
Strongly disagree that they're terrible at asking questions.
They're terrible at asking questions unless you ask them to... at which point they ask good, sometimes fantastic questions.
All my major prompts now have some sort of "IMPORTANT: before you begin you must ask X clarifying questions. Ask them one at a time, then reevaluate the next question based on the response"
X is typically 2–5, which I find DRASTICALLY improves output.
I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed.
I was dealing with a particularly tricky problem in a technology I'm not super familiar with and GPT-5 eventually asked me to put in some debug code to analyze the state of the system as it ran. Once I provided it with the feedback it wanted, and a bit of back and forth, we were able to figure out what the issue was.
Not if they're instructed to. In my experience you can adjust the prompt to make them ask questions. They ask very good questions actually!
_Did you ask it to ask questions?_
LLMs will gladly go along with bad ideas that any reasonable dev would shoot down.
Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet.
All while having the tone of an over caffeinated intern who has only ever read medium articles.
You can't fix it.
This is because LLMs trend towards the centre of the human cognitive bell curve in most things, and a LOT of humans use this same problem solving approach.