In other words, they're OK in use-cases that programmers need to eliminate, because it means there's high demand for a reusable library, some new syntax sugar, or an improved API.
As I've matured as a developer, I've appreciated certain types of boilerplate more and more because it's code that shows up in your git diffs. You don't need to chase down the code in some version of some library to see how something works.
Of course, not all boilerplate is created equally.
Edit: I'm not going to code asm just to be cool.
Every 'modern' project takes bucket loads of (annoying) setup and plumbing. Even in rather trivial 'start' cases, the LLM has to spend quite a bit of time to get a hello world thingy working (almost all of that is because 'modern programmers' have some kind of brain damage concerning backward compatibility; move fast and break things between MINOR versions that have no use or reason for those changes whatsoever but 'they liked it better', so stuff never works as it says on the product page; heaven help you if you need something exotic to be added). It's a terrible timeline for programming, but LLMs do fix at least that annoyance by just changing configs/slabs of code it until it works.
Edit: most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.
The issue is if you have an LLm write for you 10k lines of code, where 100 lines are bugged. Now you need to debug the code you did not write and find the bugged code, you will waste similar amount of time. The issue is if you do not catch the bugs in time, you think you gain some hours but you will get upset customers because things went wrong because the code is weird.
From my experience you need to work withan LLM and have the code done function by function, with your input and you checking it and calling bullshit when it does stupid things.
Simple repetitive shit is easy to reason about, debug and onboard people on.
Naturally it's balancing act, and modern/popular frameworks are where most people landed, there's been a lot of iteration in this space for decades now.
After a few years those copy and pasted code pieces completely drift apart and create a lot of similar but different issues, that need to be addressed one by one.
My approach for designing abstractions is always to make them composable (not this enterprise java inheritance chaos). To allow escaping them when needed.
I mean is that bad? Unless you keep having to have huge MRs that modify every copy/paste could you just let the code sit there and run forever?
I only say this because I've been a maintenance programmer and I could only dream of a codebase like this. The idea that I get a Rollbar with a stack trace and the entirety of what the code actually does is laid bare right at the site of the error in a single file is amazing. And I can change it without affecting anything else?! I end up having to "unwind" all of the abstractions anyway because the nature of the job means I'm not intimately familiar with the codebase and don't just know where the real work happens.
The landscape (browser capabilities, backend stacks) has settled over last ten years.
We even had time to standardize on things somewhat.
Building new abstractions at this point is almost always the wrong move. This is one of the points that LLMs will improve in software dev - they will kill the framework churn because they will work best on stuff already in training data.
That sort of stuff can be very helpful to newbies in the DIY electronics world for example.
But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.
stuff like that works amazing
> But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines
This was my opinion 3-6 months ago. But I think a lot of tools matured enough to already provide a lot of value for complex tasks. The difficult part is to learn when and how to use AI.
You also have to be very explanatory with a direct communication style.
Our system imports the codebase so it can search and navigate plus we feed lsp errors directly to the LLM as development is happening.
Yeah I guess that would help a lot. Stuck with a bit more primitive tools here, so that doesn't help.
Yep, if it's newbie stuff there are enough tutorials out there for the LLM to have data.
It's when you get off the tutorials and do the actual functionality of your project that they fail.
They can't, they usually just don't understand the code enough to notice the issues immediately.
The perceived quality of LLM answers is inversely proportional to the user's understanding of the topic they're asking about.
When I'm using llama.vim, like 40% of what it writes in a 4-5 line completion is exactly what I'd write. 20-30% is stuff that I wouldn't judge coming from someone else, so I usually accept it. And 30-40% is garbage... but I just write a comment or a couple of lines, instead, and then reroll the dice.
It's like working through a junior engineer, except the junior engineer types a new solution instantly. I can get down to alternating between mashing tab and writing tricky lines.
My comment was more about just asking questions on how to do things you're totally clueless about, in the form of "how do I implement X using Y?" for example. I've found that, as a general rule, if I can't find the answer to that question myself in a minute or two of googling, LLMs can't answer it either the majority of the time. This would be fine if they said "I don't know how to do that" or "I don't believe that's possible" but no, they will confidently make up code that doesn't work using interfaces that don't exist, which usually ends up wasting my time.
But if I'm writing out a bunch of linear algebra, I get a lot of useful completions and avoid tediousness.
I've settled on Qwen2.5.1-Coder-7B-Instruct-Q6_K_L.gguf-- so not even a very big model.
Prompt based stuff, like "extract the filtering part from all API endpoints in folder abc/xyz. Find a suitable abstraction and put this function into filter-utils.codefile"
A few months ago I tried to do a small project with Langchain. I'm a professional software developer, but it was my first Python project. So I tried to use a lot of AI generated code.
I was really surprised that AI couldn't do much more than in the examples. Whenever I had some things to solve that were not supported with the Langchain abstractions it just started to hallucinate Langchain methods that didn't exist, instead of suggesting some code to actually solve it. I had to figure it out by myself, the glue code I had to hack together wasn't pretty, but it worked. And I learned not to use Langchain ever again :)
- "Solve" the issue assigned to me with a bunch of code that looks about right. Passes review and probably not covered by tests anyway.
- Once QA or customers notice it's not working, I can get credit for "solving" the bug as well.
- Repeat for 0 value delivered but infinite productivity points in my next performance review.
I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:
1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent
AND
2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.
What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?
The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.
I don't think that we yet have one language that is good at all that. And yes, I (sometimes) program in Go for a living.
Really?
Logging in Go: A Comparison of the Top 9 Libraries
https://betterstack.com/community/guides/logging/best-golang...
Let us say there is 3k lines of RFCs, API, documentation of niche languages, examples), 2k lines of code generated by Claude (iteratively, starting small), then I do exceed the limit after a while. In that case I ask it to summarize everything in detail, start a new chat, use those 3k lines and the recent code, and continue ad infinitum.
Claude has done a good job refactoring, though I’ve had to tell it to give me a refactor plan upfront in case the conversation limit gets hit. Then in a new chat I tell it which parts of the plan it has already done.
But a larger context/conversation limit is definitely needed because it’s super easy to fill up.
https://github.com/williamcotton/webdsl
It's a pipeline-based DSL for building web apps with SQL, Lua, jq and mustache templates.
I'd say it's like 90% Cursor Composer in Agent mode.
This is probably more like a mid-sized project, right?
What is the LOC of cgit? Because I made a lot of changes to it, too, albeit privately. I uploaded most files as project files.
You can fit a hell of a lot of functionality in 3k statements. Really whether it's considered large or small necessarily must rely on the functionality it's intended to provide.
So no context, and differences of the definition of "large".
Perhaps if you come from Java, then yeah.
shrugs
I worked on a Python project which I'd consider as medium sized, ie not small but not large, and it was around 25kLOC.
My $dayjob is a Delphi codebase with roughly 500kLOC, which I'd say is large but not huge.
Though if you wrote it in something like K[1], then yeah ok, I'd agree 3kLOC probably counts as large.
Common coding tasks are going to be better represented in the training set and give better results.
You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).
That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.
You write tests in the same way as you would when checking your own work or delegating to anyone else?
At some point it's not really worth anymore creating the perfect prompt, just code it yourself. Also saves the time to carefully review the AI generated code.
First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.
Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”
The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.
It’s a public service: helping the average person learn that AI can’t be trusted to get its facts right
Dgg'ing google is still a better resort despite the drop in quality results.
Claude at least will give me an example relevant to my code with real world implementation code.
step 2, ??? (it obviously is not generating code)
step 3, profit!
As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.
So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they: - Rewrote tests in a way that the broken behaviour passes the test - Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }` - Failed to even update the files properly, wasting a bunch of tokens
I've done a lot of C++ with GPT-4, GPT-4 Turbo and Claude 3.5 Sonnet, and at no point - not once - has any of them ever hallucinated a language feature for me. Hallucinating APIs of obscure libraries? Sure[0]. Occasionally using a not-yet-available feature of the standard library? Ditto, sometimes, usually with the obvious cases[1]. Writing code in old-school C++? Happened a few times. But I have never seen it invent a language feature for C++.
Might be an issue of prompting?
From day one, I've been using LLMs through API and alternate frontend that lets me configure system prompts. The experience described above came from rather simple prompts[2], but I always made sure to specify the language version in the prompt. Like this one (which I grabbed from my old Emacs config):
"You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers. Always double-check your replies for correctness. Unless stated otherwise, assume C++17 standard is current, and you can make use of all C++17 features. Reply concisely, and if providing code examples, wrap them in Markdown code block markers."
It's as simple as it gets, and it didn't fail me.
EDIT:
Of course I had other, more task-specific prompts, like one for helping with GTest/GMock code; that was a tough one - for some reason LLMs loved to hallucinate on the testing framework for me. The one prompt I was happiest with was my "Emergency C++17 Build Tool Hologram" - creating an "agent" I could copy-paste output of MSBuild or GCC or GDB into, and get back a list of problems and steps to fix them, free of all the noise.
On that note, I had mixed results with Aider for C++ and JavaScript, and I still feel like it's a problem with prompting - too generic and arguably poisons the context with few-shot learning examples that use code that is not in the language my project is.
--
[0] - Though in LLMs' defense, the hallucinated results usually looked like what the API should have been, i.e. effectively suggesting how to properly wrap the API to make it more friendly. Which is good development practice and a useful way to go about solving problems: write the solution using non-existing helpers that are convenient for you, and afterwards, implement the helpers.
[1] - Like std::map<K,T>::contains() - which is an obvious API for such container, that's typically available and named such in any other language or library, and yet only got introduced to C++ in C++20.
[2] - I do them differently today, thanks to experience. For one, I never ask the model to be concise anymore - LLMs think in tokens, so I don't want to starve them. If I want a fixed format, it's better to just tell the model to put it at the end, and then skim through everything above. This is more-less the idea that "thinking models" automate these days anyway.
>You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers.
It would gravitate towards input from people worthy of that description?
Would there be an inverse version of this?
You are a junior developer, you lack experience, you quickly put things together that probably don't work.
Also had this thought...
Humanity has left all coding to LLMs and has hooked up all infrastructure to it. LLMs now run the world. Humanities survival depends on your ability to solve the following problem:
inb4 you just aren't prompting correctly
So no, it's not a lack of skill in prompting: I've sat down with "prompting" "experts" and universally they overlook glaring issues when assessing the how good an answer it was. When I tell them where to press it further it breaks down with even worse gibberish.
It’s when you don’t know what you don’t know that they can be harmful. It’s the issue with Stackoverflow but more pronounced.
I don’t want to use LLMs because I think they’re unethical and I dont want to depend on a tool that requires internet, but I think if you take a disciplined approach then they can really speed up development.
Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run
[0] https://www.underhanded-c.org/#winner [1] https://www.underhanded-c.org/_page_id_17.html
The underhanded C contest is not a case of people accidentally producing highly misleading code, it is a case of very smart people going to a great amount of effort to intentionally do that.
Most of the time, if your code is wrong, it doesn't work in some obvious way – it doesn't compile, it fails some obvious unit tests, etc.
Code accidentally failing in some subtle way which is easy to miss is a lot rarer – not to say it never happens – but it is the exception not the rule. And it is something humans do too. So if an LLM occasionally does it, they really aren't doing worse than humans are.
> You can't substitute an automated process for thinking deeply and carefully about the code.
Coding LLMs work best when you have an experienced developer checking their output. The LLM focuses on the boring repetitive details leaving the developer more time to look at the big picture – and doing stuff like testing obscure scenarios the LLM probably wouldn't think of.
OTOH, it isn't like all code is equal in terms of consequences if things go wrong. There's a big difference between software processing insurance claims and someone writing a computer game as a hobby. When the stakes are low, lack of experience isn't an issue. We all had to start somewhere.
Let's assume that 1 in 10,000 coding sessions produce an innocuous, test-passing function that's catastrophically wrong. If you have a mid to large size company with 1000 devs doing two sessions a day, you'll see one of these a week within that single company. Actually sounds a lot like the IoT industry now that I've written it.
There are a lot of positive things we can do with current model abilities, especially as we make them cheaper, but they aren't at the point where they will be truly destructive (people using them to make bioweapons or employers using them to cause widespread unemployment across industries, or the far more speculative ASI takeover).
It gives society a bit of time to catch up and move in a direction where we can better avoid or mitigate the negative consequences.
It would never get the answer right. Often transposing the scores, getting the game location wrong and on multiple occasions saying a 38-38 draw was an England win.
As in literally saying " England won 38-38"
> LLMs are really smart most of the time.
No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
Seriously, some these conversations feel like interacting someone who believes casting bones and astrology are accurate. Likely because in both cases they are a result of confirmation bias.
The burden of proof for that claim is on you, we cannot start with the assumption these are intelligent systems and disprove it - we have to start with the fact that training is a non-deterministic process and prove that it exhibits intelligence.
You mean like us? Because it takes many runs and debug rounds to make anything that works. Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
Both humans and LLMs get into bugs, the question is can we push this to the correct solution or get permanently stuck along the way? And this depends on feedback, sometimes access to a testing environment where we can't let AI run loose.
Not, not like us. This constant comparison of LLMs to humans is tiresome and unproductive. I’m not a proponent of human exceptionalism, but pretending LLMs are on par is embarrassing. If you want to claim your thinking ability isn’t any better than an LLM, that’s your prerogative, but the dullest human I’m acquainted with is still capable of remembering and thinking through in a way no LLM does. Heck, I can think of pets which do better.
> Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
I certainly don’t make up methods which don’t exist, nor do I insist over and over they are valid, nor do I repeat “I’m sorry, this was indeed not right” then say the same thing again. I have never “hallucinated” a feature then doubled down when confronted with proof it doesn’t exist, nor have I ever shifted my approach entirely because someone simply said the opposite without explanation. I certainly hope you don’t do that either.
> I have never “hallucinated” a feature
You never mistook an argument, or forgot one of the 100 details we have to mind while writing complex apps?
I would rather measure humans vs AI not in basic skill capability but in authonomy. I think AI still has much more to catch up on that front.
The amount of code out there lets LLMs learn on examples of a lot of tasks and pass SWE benchmarks. Smart autocomplete and solved problem lookup has value. Even if it doesn’t always work correctly, and doesn’t know what it knows.
And for non-programmers they’re indistinguishable from programmers. They produce working code.
It’s easy to see how a product manager or a designer or even a manager that didn’t code for years in a large company can think they’re almost as good as devs.
I've just written up a longer form of that comment: "Hallucinations in code are the least dangerous form of LLM mistakes" - https://simonwillison.net/2025/Mar/2/hallucinations-in-code/
Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.
In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.
As usual, the defenses against this are extensive unit-test coverage and/or static typing.
Right, that's the exact point I make in my blog post: you have to TEST the code - not just with automated tests, you have to actually try it out yourself as well. https://simonwillison.net/2025/Mar/2/hallucinations-in-code/
Semantics perhaps, but that’s my take.
Prompt injection (hidden or not) is another insane vulnerability vector that can't easily be fixed.
You should treat any output of an LLM the same way as untrusted user input. It should be thoroughly validated and checked if used in even remotely security critical applications.
https://www.tomshardware.com/tech-industry/artificial-intell...
No, it’s presented in the training data as an idea for an interface - the LLM took that and presented it as an existing solution.
It's more complicated than what happens with Markov chain models but you can use them to build an intuition for what's happening.
Imagine a very simple Markov model trained on these completely factual sentences:
- "The sky is blue and clear"
- "The ocean is blue and deep"
- "Roses are red and fragrant"
When the model is asked to generate text starting with "The roses are...", it might produce: "The roses are blue and deep"
This happens not because any training sentence contained incorrect information, but because the model learned statistical patterns from the text, as opposed to developing a world model based on physical environmental references.
>> "The more you can see the inputs and outputs as blobs of "stuff," the better. If LLMs think, it's not in any way we yet understand. They're probability engines that transform data into different data using weighted probabilities."
Stuff in, stuff out.
> to quote karpathy: "I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines."
https://nicholas.carlini.com/writing/2025/forecasting-ai-202... (click the button to see the study then scroll down to the hallucinations heading)
The training data is the Internet. It has mistakes. There's no available technology to remove all such mistakes.
Whether LLMs hallucinate only because of mistakes in the training data or whether they would hallucinate even if we removed all mistakes is an extremely interesting and important question.
Sometimes an llm will hallucination a flag, or option that really makes sense - it just doesn't actually exist.
I'm not saying this is the case, but LLMs are often wrong in subtle ways like this.
It'll be like a slack support channel, for robots.
A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.
Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.
It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.
If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.
There is no self awareness about accuracy when the model can not provide any kind of confidence scores. Couching all of its replies in "this is AI so double check your work" is not self awareness or even close, it's a legal disclaimer.
And as the other reply notes, are you a bot or just heavily dependent on them to get your point across?