As much as I've agreed with the author's other posts/takes, I find myself resisting this one:
> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.
No, that does not follow.
1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.
2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.
3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.
However you do it, somebody else should still be reviewing the change afterwards.
I've spent a lot of time reviewing code and doing code audits for security (far more than the average engineer) and reading code still takes longer than writing it, particularly when it is dense and you cannot actually trust the comments and variable names to be true.
AI is completely untrustable in that sense. The English and code have no particular reason to align so you really need to read the code itself.
These models may also use unfamiliar idioms where you don't know the edge cases where you either have to fight the model to do it a different way, or go investigate the idiom and think through the edge cases if you really want to understand it.
I think most people just don't read the code these models produce at all and just click accept and then just see if tests pass or just look at the output manually.
I am still trying to give it a go, and sometimes it really does make things easier on simpler tasks and I am blown away, and it has been getting better, but I feel like I need to set myself a hard timeout with these tools where if they haven't done basically what I wanted quickly, I should just start from scratch since the task is beyond them and I'll spend more time on the back and forth.
They are useful for giving me the motivation to do things that I'm avoiding because they're too boring though because after fighting with them for 20 minutes I'm ready to go write the code.
With humans you can be reasonably sure they've followed through with a mostly consistent level of care and thouhht. LLMs will just outright lie to make their jobs easier in one section while in another area generate high quality code.
I've had to do a 'git reset --hard' after trying out the Claude code and spending $20 bucks. It always seems great at first, but it just becomes non-sense on larger changes. Maybe chain of thought models do better though.
I asked Gemini for the lyrics of a song that I knew was on all the main lyrics sites. It gave me the lyrics to a different song with the same title. On the second try, it hallucinated a batch of lyrics. Third time, I gave it a link to the correct lyrics, and it "lied" and said it had consulted that page to get it right but gave me another wrong set.
It did manage to find me a decent recipe for chicken salad, but I certainly didn't make it without checking to make sure the ingredients and ratios looked reasonable. I wouldn't use code from one of these things without closely inspecting every line, which makes it a pointless exercise.
I'm surprised it didn't outright reject your request to be honest.
Interesting is a very kind word to use there
And even if they fail, other humans are more likely to fail in ways we are familiar-with and can internally model and anticipate ourselves.
I can browse through any Java/C#/Go code and without actually reading every keyword see how it flows and if there's something "off" about how it's structured. And if I smell something I can dig down further and see what's cooking.
If your chosen language is difficult/slow to read, then it's on you.
And stuff should have unit tests with decent coverage anyway, those should be even easier for a human to check, even if the LLM wrote them too.
But on the flip side, this type of code is intrinsically less valuable than novel stuff ("convert this signed distance field to a mesh") which an LLM will choke on.
Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.
Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.
- keep the outline in my head: I don't give up the architect's seat. I decide which module does what and how it fits in the whole system, it's contract with other modules etc.
- review the code: this can be construed as negating the point of LLMs as this is time consuming but I think it is important to go through line by line and understand every line. You will absorb some of the LLM generated code in the process which will form an imperfect map in your head. That's essential for beginning troubleshooting next time things go wrong.
- last mile connectivity: several times the LLM takes you there but can't complete the last mile connectivity; instead of wasting time chasing it, do the final wiring yourself. This is a great shortcut to achieve the previous point.
My solution for this is documentation, automated tests and sticking to the same conventions and libraries (like using Click for command line argument parsing) across as many projects as possible. It's essential that I can revisit a project and ramp up my mental model of how it works as quickly as possible.
I talked a bit more about this approach here: https://simonwillison.net/2022/Nov/26/productivity/
"Short cuts make long delays." --Tolkien
How does doing the hard part provide a shortcut for reviewing all the LLVM code?
If anything it's a long cut, because now you have to understand the code and write it yourself. This isn't great, it's terrible.
I think that just because someone might be more or less eloquent than someone else, the value of their thoughts and contributions shouldn't be weighed any differently. In a way, AI formatting and grammar assistance could be a step towards a more equitable future, one where ideas are judged on inherent merits rather than superficial junk like spel;ng or idk typos n shi.t
However, I think what the parent commenter (and I) might be saying is that it seems you're relying on AI for more than just help expressing yourself—it seems you're relying on it to do the thinking too. I'd urge you to consider if that's what you really want from a tool you use. That said, I'm just some random preachy-judgy stranger on the internet, you don't owe me shit, lol
(Side notes I couldn't help but include: I think talking about AI and language is way more complicated (and fascinating) than just that aspect, including things I'm absolutely unqualified to comment on—discrimination against AAVE use, classism, and racism can't and shouldn't be addressed by a magic-wand spell-checker that "fixes" everyone's speech to be "correct" (as if a sole cultural hegemony or way of speech is somehow better than any other))
> I think that just because someone might be more or less eloquent than someone else, the value of their thoughts and contributions shouldn't be weighed any differently. In a way, AI formatting and grammar assistance could be a step towards a more equitable future, one where ideas are judged on inherent merits rather than superficial junk like spel;ng or idk typos n shi.t
I guess I must come clean that my reply was sarcasm which obviously fell flat and caused you to come to the defense of those who can't spell - I swear I don't have anything against them.
> However, I think what the parent commenter (and I) might be saying is that it seems you're relying on AI for more than just help expressing yourself—it seems you're relying on it to do the thinking too. I'd urge you to consider if that's what you really want from a tool you use. That said, I'm just some random preachy-judgy stranger on the internet, you don't owe me shit, lol
You and presumably the parent commenter have missed the main point of the retort - you are assuming I am relying on AI for my content or its style. It is neither - I like writing point-wise in a systematic manner, always have, always will - AI or no-AI be damned. It is the all-knowing veil-piercing eagle-eyed deduction of random preachy-judgy strangers on the internet about something being AI-generated/aided just because it follows structure, that is annoying.
This is a degree of humility that made the scenario we are in much clearer.
Our information environment got polluted by the lack of such humility. Rhetoric that sounded ‘right’ is used everywhere. If it looks like an Oxford Don, sounds like an Oxford Don, then it must be an academic. Thus it is believable, even if they are saying the Titanic isn’t sinking.
Verification is the heart of everything humanity does, our governance structures, our judicial systems, economic systems, academia, news, media - everything.
It’s a massive computation effort to figure out what the best ways to allocate resources given current information, allowing humans to create surplus and survive.
This is why we dislike monopolies, or manipulations of these markets - they create bad goods, and screw up our ability to verify what is real.
But the most troublesome to me is that it is just "pissing" out code and has no after-tough about the problem it is solving or the person it is talking to.
The number of times I have to repeat myself just to get a stubborn answer with no discussion is alarming. It does not benefit my well-being and is annoying to work with except for a bunch of exploratory cases.
I believe LLM are actually the biggest data heist organized. We believe that those models will get better at solving their jobs but the reality is that we are just giving away code, knowledge, ideas at scale, correcting the model for free, and paying to be allowed to do so. And when we watch the 37% minimum hallucination rate, we can more easily understand that the actual tough comes from the human using it.
I'm not comfortable having to argue with a machine and have to explain to it what I'm doing, how, and why - just to get it to spam me with things I have to correct afterwards anyway.
The worst is, all that data is the best insight on everything. How many people ask for X ? How much time did they spend trying to do X ? What were they trying to achieve ? Who are their customers ? etc...
When people talk about 30% or 50% coding productivity gains with LLMs, I really want to know exactly what they're measuring.
puzzled. if you don't understand it fully, how can you say that it will look great to you, and that it will work?
>(in other words, I can't prove the correctness ... without referencing the original paper).
agrees with what I said in my previous comment:
>if you don't understand it fully, how can you say .... that it will work?
(irrelevant parts from our original comments above, replaced with ... , without loss of meaning to my argument.)
both those quoted fragments, yours and mine, mean basically the same thing, i.e. that both you and the GP don't know whether it will work.
it's not that one cannot use some piece of code without knowing whether it works; everybody does that all the time, from algorithm books for example, as you said.
Presumably, that simply reflects that a primary developer always has an advantage of having a more reliable understanding of a large code base - and the insights into the problem that come about during development challenges - than a reviewer of such code.
A lot of important bug subtle insights, many sub-verbal, into a problem come from going through the large and small challenges of creating something that solves it. Reviewers just don't get those insights as reliably.
Reviewers can't see all the subtle or non-obvious alternate paths or choices. They are less likely to independently identify subtle traps.
‘Works for me’ isn’t actually _that_ useful a signal without serious qualification.
yes, and it sounds a bit like "works on my machine", a common cop-out which I am sure many of us have heard of.
google: works on my machine meme
testing can prove the presence of errors, but not their absence.
https://www.google.com/search?q=quote+testing+can+prove+the+...
- said by Steve McConnell (author of Code Complete), Edsger Dijkstra, etc. ...
For AI code, that's a waste of time. The generated code will be based on an arbitrary patchwork of purposes and constraints, glued together well enough to function. I'm not saying it lacks purpose or constraints, it's just that those are inherited from random sources. The parts flow together with robotic but not human concern for consistency. It may incorporate brilliant solutions, but trying to infer intent or style or design philosophy is about as useful as doing handwriting analysis on a ransom note made from pasted-together newspaper clippings.
Both sorts of code have value. AI code may be well-commented. It may use features effectively that a human might have missed. Just don't try to anthropomorphize an AI coder or a lawnmower, you'll end up inventing an intent that doesn't exist.
- generate - lint - format - fuzz - test - update
infintely?
Sorry, I'm failing to see your point.
Are you implying that the above is good enough, for a useful definition of good enough? I'm not disagreeing, and in fact that was my starting assumption in the message you're replying to.
Crap code can pass tests. Slow code can pass tests. Weird code can pass tests. Sometimes it's fine for code to be crap, slow, and/or weird. If that's your situation, then go ahead and use the code.
To expand on why someone might not want such code, think of your overall codebase as having a time budget, a complexity budget, a debuggability budget, an incoherence budget, and a maintenance budget. Yes, those overlap a bunch. A pile of AI-written code has a higher chance of exceeding some of those budgets than a human-written codebase would. Yes, there will be counterexamples. But humans will at least attempt to optimize for such things. AIs mostly won't. The AI-and-AI-using-human system will optimize for making it through your lint-fuzz-test cycle successfully and little else.
Different constraints, different outputs. Only you can decide whether the difference matters to you.
Just recently I think here on HN there was a discussion about how neural networks optimize towards the goal they are given, which in this case means exactly what you wrote, including that the code will do stuff in wrong ways just to pass the given tests.
Where do the tests come from? Initially from a specification of what "that thing" is supposed to do and also not supposed to do. Everyone who had to deal with specifications in a serious way knows how insanely difficult it is to get these right, because there are often things unsaid, there are corner cases not covered and so on. So the problem of correctness is just shifted, and the assumption that this may require less time than actually coding ... I wouldn't bet on it.
Conceptually the idea should work, though.
it's what our lord and savior jesus christ uses for us humans if it is good for him its good enough for me. and surely google is not laying off 25k people because it believes humans are better than their LLMs :)
All of these can vary wildly in quality. Maybe its because I mostly use coding LLMs as either a research tool, or to write reasonably small and easy to follow chunks of code, but I find it no different than all of the other types of reading and understanding other people's code I already have to do.
Alas, I don't share your optimism about code I wrote myself. In fact, it's often harder to find flaws in my own code, then when reading someone else's code.
Especially if 'this is too complicated for me to review, please simplify' is allowed as a valid outcome of my review.
You don't know that though. There's no "it must work" criteria in the LLM training.
Claude Code has been doing all of this for me on my latest project. It's remarkable.
It seems inevitable it'll get there for larger and more complex code bases, but who knows how far away that is.
If you don’t understand it, ask the LLM to explain it. If you fail to get an explanation that clarifies things, write the code yourself. Don’t blindly accept code you don’t understand.
This is part of what the author was getting at when they said that it’s surfacing existing problems not introducing new ones. Have you been approving PRs from human developers without understanding them? You shouldn’t be doing that. If an LLM subsequently comes along and you accept its code without understanding it too, that’s not a new problem the LLM introduced.
At least when a human wrote it, someone understood the reasoning.
I was appalled when I was being effusively thanked for catching some bugs in PRs. “No one really reads these,” is what I was told. Then why the hell do we have a required review?!
This is what tests are for.
I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.
You can't just rewrite everything to match your style. You take what's in there and adapt to the style, your personal preference doesn't matter.
Writing is a very solid choice as an approach to understanding a novel problem. There's a quip in academia - "The best way to know if you understand something is to try teaching it to someone else". This happens to hold true for teaching it to the compiler with code you've written.
You can't skip details or gloss over things, and you have to hold "all the parts" of the problem together in your head. It builds a very strong intuitive understanding.
Once you have an intuitive understanding of the problem, it's very easy to drop into several different implementations of the solution (regardless of the style) and reason about them.
On the other hand, if you don't understand the problem, it's nearly impossible to have a good feel for why any given solution does what it does, or where it might be getting things wrong.
---
The problem with using an AI to generate the code for you is that unless you're already familiar with the problem you risk being completely out of your depth "code reviewing" the output.
The difficulty in the review isn't just literally reading the lines of code - it's in understanding the problem well enough to make a judgement call about them.
I'm pretty sure mentally rewrite it requires _more_ effort than writing it in the first place. (maybe less time though)
100%. Case in point for case in point - I was just scratching my head over some Claude-produced lines for me, thinking if I should ask what this kind entity had in mind when using specific compiler builtins (vs. <stdatomic.h>), like, "is there logic to your madness..." :D
size_t unique_ips = __atomic_load_n(&((ip_database_t*)arg)->unique_ip_count, __ATOMIC_SEQ_CST);
I think it just likes compiler builtins because I mentioned GCC at some point...The requirements have to come from somewhere, after all.
I know what Spock would say about this approach, and I'm with him.
Donald E. Knuth
If LLM-generated code has been "reasoned-through," tested, and it does the job, I think that's a net-benefit compared to human-only generated code.
Net-benefit in what terms though? More productive WRT raw code output? Lower error rate?
Because, something about the idea of generating tons of code via LLMs, which humans have to then verify, seems less productive to me and more error-prone.
I mean, when verifying code that you didn't write, you generally have to fully reason through it, just as you would to write it (if you really want to verify it). But, reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.
OTOH, if you just breeze through it because it looks correct, you're likely to miss errors.
The latter reminds me of the whole "Full self-driving, but keep your hands on the steering wheel, just in case" setup. It's going to lull you into overconfidence and passivity.
This is actually a trick though. No one working on self driving actually expects people to actually babysit it for long at all. Babysitting actually feels worse than driving. I just saw a video on self-driving trucks and how the human driver had his hands hovering on the wheel. The goal of the video is to make you think about how amazing self-driving rigs will be, but all I could think about was what an absolutely horrible job it will be to babysit these things.
Working full-time on AI code reviews sounds even worse. Maybe if it's more of a conversation and you're collaboratively iterating on small chunks of code then it wouldn't be so bad. In reality though, we'll just end up trusting the AI because it'll save us a ton of money and we'll find a way to externalize the screw ups.
And, in my experience, it’s a lot easier to latch on to a real person’s real line of reasoning rather than a chatbot’s “line of reasoning”
And you can discuss these, with both of you hopefully having experience in the domain.
If you employ AI, you're adding a remarkable amount of speed, to a processing domain that is undecidable because most inputs are not finite. Eventually, you will end up reconsidering the Gambler's Fallacy, because of the chances of things going wrong.
They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:
https://www.youtube.com/watch?v=NW6PhVdq9R8
This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.
While I still think all this code generation is super cool, I've found that the 'density' of the code makes it even more noticeable - and often annoying - to see the model latch on, say, some part of the conversation that should essentially be pruned from the whole thinking process, or pursue some part of earlier code that makes no sense to me, and then 'coaxing' it again.
This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".
The opaque wall that separates the solution from the problem in technology often comes from the very steep initial learning curve. The reason most people who are developers now learned to code is because they had free time when they were young, had access to the technology, and were motivated to do it.
But as an adult, very few people are able to get past the first obstacles which keep them from eventually becoming proficient, but now they have a cheat code. So you will see a lot more capable programmers in the future who will be able to help you fix this backlog of bad code -- we just have to wait for them to gain the experience and knowledge needed before that happens and deal with the mistakes along the way.
This is no different from any other enabling technology. The people who feel like they had to struggle through it and pay their dues when it 'wasn't easy' are going to be resentful and try and gatekeep; it is only human nature.
Coding is unique. One can't replace considered, forward-thinking data flow design reasoning with fancy guesswork and triage.
Should anyone build a complex brick wall by just iterating over the possible solutions? Hell no. That's what expertise is for, and that is only attained via hard graft, and predicting the next word is not going to be a viable substitute.
It's all a circle jerk of people hoping for a magic box.
Are you really unique because you are one of only a few special people who can code because of some innate ability? Or is it that you have above average intelligence, have a rather uncommon but certainly not rare ability to think a certain way, and had an opportunity and interest which honed those talents to do something most people can't?
How would you feel if you never had access to a computer with a dev environment until you were an adult, and then someone told you not to bother learning how to code because you aren't special like they are?
The 'magic box' is a way to get past the wall that requires people to spend 3 hours trying to figure out what python environments are before they can even write a program that does anything useful.
But it all compounds. Going from reading to doing takes little time and I’m able to use much denser information repositories.
If you have to spend three hours reading about python environments, that’s just a signal that your foundation is lacking (you don’t know how your tools work). Using LLM is flying blind and hoping you will land instead of crashing.
One quibble, however, is that python environments are a mess (as is any 3rd party software use in any environment, in my limited experience), and I refuse to use any such thing, when at all possible. If I couldn't directly integrate that code into my codebase, I won't waste my time, because every dependency is another point of failure, either the author's or (more likely) that I might muck up my use of it. Then, there are issues such as versioning, security, and even the entire 3rd party management software itself. It does not look like it will actually save me any time, and might end up being a huge drag on my progression.
That said, using an LLM for ANYTHING is super risky IMO. Like you said, a person should read about what the think they want to utilize, and then incrementally build up the needed skills and experience by using it.
There are many paths in life that have ZERO shortcuts, but there are also many folks who refuse to acknowledge that difficult work is sometimes absolutely unavoidable.
I'm talking about the fact that programming is a unique human endeavor, and a damned difficult one at that.
> How would you feel if you never had access to a computer with a dev environment until you were an adult, and then someone told you not to bother learning how to code because you aren't special like they are?
I would never say some stupid shit like that, to anyone, ever. If they want to do it, I would encourage them and give them basic advice to help them on their way. And I IN NO WAY believe that I am more talented at programming than ANYONE else on Earth. The experience I have earned from raw, hard graft across various programming environments and projects is my only advantage in a conversion about software development. But I firmly believe that a basic linux install and python, C, and bash will be enough to allow anyone to reach a level of basic professional proficiency.
You are WAY out of pocket here, my friend, or perhaps you just don't understand English very well.
> When did you learn to code? What access did you have to technology when you started? How much free time did you have? What kind of education did you have?
Getting to learn BASIC on an Apple (2e?) in 6th grade was fantastic for me; it was love at first goto. But having a C64 in 9th Grade was pivotal to the development of my fundamental skills and mindset, and I was very lucky to be in a nice house with the time to write programs for fun, and an 11th grade AP CS course with a very good teacher and TRS80s. But we were very much lower middle class, which factored into my choice of college and how well I did there. But, absolutely, I am a very, very lucky human being, yet tenacity via passion is the key to my success, and is not beyond ANYONE else.
> The 'magic box' is a way to get past the wall that requires people to spend 3 hours trying to figure out what python environments are before they can even write a program that does anything useful.
If you say so, but no one should be learning to program in a specific python env or doing anything "useful" except for personal exploration or rudimentary classwork.
Educating ourselves about how to logically program -- types, vars, fcts, files -- is our first "useful" programming any of us will be able to do for some years, which is no different than how an auto mechanic will ramp up to professional levels of proficiency, from changing oil to beyond.
With the internet in 2025, however, I'm sure people can learn more quickly, but if and only if they have the drive to do so.
These code AIs are just going to get better and better. Fixing this "tsunami of bad code" will consist of just passing it through the better AIs that will easily just fix most of the problems. I can't help but feel like this will be mostly a non-problem in the end.
At this point in time there's no obvious path to that reality, it's just unfounded optimism and I don't think it's particularly healthy. What happens 5, 10, or 20 years down the line when this magical solution doesn't arrive?
What you want is an LLM that is exceptionally good at completely rewriting a poorly written codebase spanning tens or hundreds of thousands of lines of code, which works reliably with minimal oversight and without introducing hundreds of critical and hard to diagnose bugs.
Not realizing that these tasks are many orders of magnitudes apart in complexity is where the "unfounded optimism" comment comes from.
You can claim that continued progression is speculative, and some aspects are, but it's hardly "an article of faith", unlike "we've suddenly hit a surprising wall we can't surmount".
Except that's not how it's actually gone. It's more like, improvements happen in erratic jumps as new methods are discovered, then improvements slow or stall out when the limits of those methods are reached.
https://hai.stanford.edu/news/ais-ostensible-emergent-abilit...
And really, there was a version of what I'm talking about in the shorter timespan with LLMs - OpenAI's GPT models existed for several years before someone got the idea to put it behind a chat interface and the popularity / apparent capability exploded a few years ago.
That's exactly what I said in the post you responded to: there weren't erratic jumps, there was steady progress over decades.
* Granted we don't know for sure it'll be short this time, but hints are that we're starting to hit that wall with improvements slowing down.
[0] https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...
[1] https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...
I'll ask it opinionated questions, and it will just do stuff to reaffirm what I said, even when I give contrary opinions in the same chat.
I personally find it annoying (I don't really get along with human people pleasers either), but I could see someone using it as a tool to justify doing bad stuff, including self-harm; it doesn't really ever push back on what I say.
> Me: Hi! Could you please help me find the problem with some code?
> ChatGPT: Of course! Show me the code and I'll take a look!
> Me: [bunch o' code]
> ChatGPT: OK, it looks like you're trying to [do thing]. What did you want help with?
> Me: I'm trying to find a problem with this code.
> ChatGPT: Sure, just show me the code and I'll try to help!
> Me: I just pasted it.
> ChatGPT: I can't see it.
Lying will goad a person into trying again; the brutally honest truth will stop them like a brick wall.
But, it's actually worse, because it's generally apologizing for something completely wrong that it told you just moments before with extreme confidence.
I mentioned that I don't like people-pleasers and I find it a bit obnoxious when ChatGPT does it. I'm sure that there might be other bits of subtle encouragement it gives me that I don't notice, but I can't elaborate on those parts because, you know, I didn't notice them.
I genuinely do not know how you got "we should restrict access" from my comment or the parent, you just extrapolated to make a pretty stupid joke.
> People like me can use them but others seem to be killed when making contact.
If I misread that, fair enough.
I recognize that the view that others should not be permitted things that I should be allowed to use is generally a sarcastically expressed view, but I genuinely think it has merit. Everyone who believes these things are dangerous and everyone to whom this is obviously dangerous, like the aforementioned mentally deficient individual, shouldn't be permitted use.
I'm increasingly coming around to the notion that AI tooling should have safety features concerned with not directly exposing humans to asymptotically increasing levels of 'convincingness' in generated output. Something like a weaker model used as a buffer.
Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.
Like most safety regulations, it'll take blood for the inking. Exposing mass numbers of people to these models strikes me as wildly negligent if we expect continued improvement along this axis.
Seriously? Do you suppose that it will pull this trick off through some sort of hypnotizing magic perhaps? I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
The kinds of people who would be convinced by such "dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings anyhow.
Aside from demonstrating the persistent AI woo that permeats many comments on this site, the logic above reminds me of the harping nonsense around the supposed dangers of video games or certain violent movies "making kinds do bad things", in years past. The prohibitionist nanny tendencies behind such fears are more dangerous than any silly chatbot AI..
These are people who have jobs and apartments and are able to post online about their problems in complete sentences. If they're not "of sound mind," we have a lot more mentally unstable people running around than we like to think we do.
So what do you believe should be the case? That AI in any flexible communicative form be limited to a select number of people who can prove they're of sound enough mind to use it unfiltered?
You see how similar this is to historical nonsense about restricting the loaning or sale of books on certain subjects only to people of a certain supposed caliber or authority? Or banning the production and distribution of movies that were claimed to be capable of corrupting minds into committing harmful and immoral acts. How stupid do these historical restrictions look today in any modern society? That's how stupid this harping about the dangers of AI chatbots will look down the road.
The limitation of AI because it may or may not cause some people to do irrational things not only smacks of a persistent AI woo on this site, which drastically overstates the power of these stochastic parrot systems, but also seems to forget that we live in a world in which all kinds of information triggers could maybe make someone make stupid choices. These include books, movies, and all kinds of other content produced far more effectively and with greater emotional impact by completely human authors.
By claiming a need for regulating the supposed information and discourse dangers of AI chat systems, you're not only serving the cynically fear-mongering arguments of major AI companies who would love such a regulatory moat around their overvalued pet projects, you're also tacitly claiming that literature, speech and other forms of written, spoken or digitally produced expression should be restricted unless they stick to the banally harmless, by some very vague definitions of what exactly harmful content even is.
In sum, fuck that and the entire chain of implicit long-used censorship, moralizing nannyism, potential for speech restriction and legal over-reach that it so bloody obviously entails.
For various reasons, I don't believe that, which is why my argument is predicated on them improving over time. Obviously current models aren't overly hazardous in the sense I posit - it's a concern for future models that are stronger, or explicitly trained to be more engaging and/or convincing.
The load bearing element is the answer to: "are models becoming more convincing over time?" not "are they very convincing now?"
> [..] I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot [..]
Then you're not engaging with the premise at all, and are attacking a point I haven't made. The tautological assurance that non-convincing AI is not convincing is not relevant to a concern predicated on the eventual existence of highly convincing AI: that sufficiently convincing AI is hazardous due to induced loss of control, and that as capabilities increase the loss of control becomes more difficult to resist.
Persuasion is mostly about establishing that doing or believing what you're telling them is in their best interest. If all my friends start telling me a piece of information, belief in that information has a real interest to me, as it would help strengthen social bonds. If I have a consciously weakly held belief in something, then a compelling argument would consist of providing enough evidence for a viewpoint that I could confidently hold that view and not worry I'll appear misinformed when speaking on it.
Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.
To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.
The absolute worst case scenario for a rogue AI is it leveraging people's belief in it to compel actions in others by way of a combination of blackmail, rewards, and threats of compelling others to commit violence on its behalf by a combination of the same.
We already live in a world with such artificial intelligences: we call them governments and corporations.
That's reasonable, and I really do hope this keeps on being the case. However, I would nit that I see this as a continuum rather than a phase change. That is, I think hazard smoothly increases with persuasiveness. I can point to some far off region and say: "oh, that seems quite concerning" but it doesn't start being so there.
Persuasiveness below the threshold of 'instant mind control' is still a hazard. Hanging out with salesmen on the job is like to loosen your wallet, even if it isn't guaranteed.
> If humans were capable of being immediately compelled to do something based on reading some text, advertisers would have taken advantage of that a looooong time ago.
I'd base my counter on the notion that the problem of persuasion is harder when you have less information about whom you're trying to convince.
To expand on the intuition behind that: advertisement-persuasion is hard in a way that conversational-persuasion is not. Shilling in conversational contexts (word of mouth) is more effective than generic advertisement.
A message that will convince one specific person is easier to generate than a message that will convince any random 10 people.
This proceeds to the idea that information about a person-under-persuasion is akin to power over them. Knowing not only what you believe but why you believe it and what else you believe adjacent to it and what you want is a force multiplier in this regard.
And so we get to AI models, which gather specific information about the mind of each person they interact with. The message is tailored to you and you alone, it is not a wide spectrum net cast to catch the largest possible number. Advertisements are qualitatively different; they do not 'pick your brain' nearly so much as the model does.
> Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.
> To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.
I don't agree with this because I don't agree with the premise that you must use a 'principled' approach to convince someone as you've described. People use heuristics to decide what to believe.
By dint of the bitter lesson, I think superhuman persuasion will involve stupid tricks of no particular principled basis that take advantage of 'invisible' vulnerabilities in human cognition.
That is, I don't think those 'reasons to believe the belief' matter. A child will believe the voice of their parents; it doesn't necessarily register that it's in their best interest or it will be bad for them if they don't. Bootstrapping children involves exploiting vulnerabilities in their psyche via implicit trust. Will the AI speak in the voice of my father, as I might hear it in prelingual childhood? Are all such mechanisms gone by adulthood? Is there anything like a generalized follow-the-leader-with-leader-detection pattern?
How hard is it for gradient descent to fit a solution to the boundaries of such heuristics?
This is however, getting into the weeds of exact mechanisms which I'm not too concerned with. I believe (but can't prove) that exploits of that nature exist (or that similarly effective means exist), and that they can be found via brute force search. I think the dominant methodology of continuously training chat models on conversational data those same models participate in is among the likeliest of ways to get to that point.
Ultimately, so long as there's no directed pressure to force people into contact with very convincing model output (see your rogue AI scenario), it doesn't seem that hard to make it safe: limit direct contact and/or require that tooling limits contact by default. Avoid multi-turn refinement and conversational history (amplification of persuasive power via mechanism described above). Treat it like a spinning blade and be it on your own head if you want to break yourself.
However, as I mentioned in my original comment, it will take blood for the inking. The incentives don't align to guard against this class of hazard from the get-go or even admit it is possible (merely to produce appearances of caring about 'safety' (read: our model won't do scary politically incorrect things!)), so we're going to see what happens when you mindlessly expose millions of people to it.
In reality, even if they improve to be completely indistinguishable from the sharpest and most persuasive of human minds our society has ever known, i'd still make exactly the same arguments as above. I'd make these for the same reason that I'd argue for how no regulatory body or self-appointed filter of moral arbiters should be able to restrict the specific arguments and formas of expression currently available for persuasive human beings, or people of any kind.
Just as we shouldn't prohibit literature, film, internet blog posts, opinion pieces in media and any other sources by which people communicate their opinions and information to others under the argument that such opinions might be "harmful" , I wouldn't regulate AI sources of information and chatbots.
One can make an easy case for regulating and punishing the acts people try to perform based on information they obtain from AI, in terms of the measurable harm these acts would cause to others, but banning a source of information based on a hypothetical, ambiguous danger of its potential for corrupting minds is little different from the idiocy of restricting free expression because it might morally corrupt supposedly fragile minds.
First, you argued the implausibility of strong persuasion. Your rhetoric was effectively "look how silly this whole notion of a machine persuading someone of something is, because how dumb would you need to be for this silly thing to convince you to do this very bad thing?"
That is then used to fuel an argument that I am merely propagating AI woo, consumed by magical thinking, and clearly am just afraid of something equivalent to violent video games and/or movies. The level of inferential contortion is difficult to wrap my head around.
Now, you seem to be arguing along an entirely different track: that AI models should have the inalienable right to self expression, for the same reason humans should have that right (I find it deeply ironic that this is the direction you'd choose after accusations of AI woo, but I digress). Or, equivalently, that humans should have the inalienable right to access and use such models.
This is no longer an argument about the plausibility of AI being persuasive, or that persuasion can be hazardous, but that we should permit it in spite of any of that because freedom of expression is generally a good thing.
(This is strange to me, because I never argued that the models should be banned or prohibited, merely that tooling should try to avoid direct human-to-model-output contact, as such contact (when model output is sufficiently persuasive) is hazardous. Much like how angle grinders or power tools are generally not banned, but have safety features preventing egregious bodily harms.)
> In reality, even if they improve to be completely indistinguishable from the sharpest and most persuasive of human minds our society has ever known, I'd still make exactly the same arguments as above.
While my true concern is systems of higher persuasiveness than humans have ever been exposed to, let's see:
> I have a hard time imagining [the most persuasive of human minds our society has ever known] convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
This is immediately falsified by the myriad examples of exactly this occurring, via a much lower bar than 'most persuasive person ever'. Hmm. Strange wonder that it requires a sarcastic caricature to not immediately seem like a nonsense argument.
Considering my entire position is simply that exposure to persuasion can be hazardous, I don't see what you're trying to prove now. It's certainly not in opposition to something I've said.
As it does seem you have shifted perspectives to the moral rather than the mechanistic, and that you've conceded that persuasion carries with it nontrivial hazard (even if we should entertain that hazard for the sake of our freedoms), are we now determining how much risk is acceptable to maintain freedoms? I'm not interested in having that discussion, as I don't purport to restrict said freedoms in any case.
Going back to the power tool analogy, you are of course free to disable safety precautions on your own personal angle grinder. At work, some sort of regulatory agency (OSHA, etc) will toil to stop your employer from doing so. I, personally, want a future of AI tooling akin to this. If AI are persuasive enough to be hazardous, I don't want to be forced by my employer to directly consume ultra-high-valence generated output. I want such high-valence content to be treated as the light of an arc-welder, something you're required to wear protection to witness or risk intervention by independent agencies that everybody grumbles about but enjoys the fruit of (namely, a distinct lack of exotic skin cancers and blindness in welders).
My point was originally and remains the bare observation that any of this will cost in blood, and whatever regulations are made will be inked in it.
I do understand the deeper motivations of your arguments, the desire to avoid (and/or fear of) gleeful overreach by the hands of AI labs who want nothing more than to wholly control all use of such models. That lies orthogonal to my basis of reasoning. It does not adequately contend with the realities of what to do when persuasiveness approaches sufficient levels. Is the truth now something to be avoided because it would serve the agenda of somebody in particular? Should we distort our understanding to not encroach on ideas that will be misappropriated by those with something to gain?
Ignoring any exposition on whether it is plausible or whether it caps out at human or subhuman or superhuman levels or any of the chaff about freedom of expression or misappropriation by motivated actors: if we do manage to build such a thing as I describe (and the hazard inherent is plainly obvious if the construction is not weakened, but resident still even if weakened), what do we do? How many millions will be exposed to these systems? How can it be made into something that retains utility yet is not a horror beyond reckoning?
There is a great deal more to say on the subject, I unfortunately don't have the time to explore it in any real depth here.
I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
The kinds of people who would be convinced by such "harm dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings, or by books, or movies or any other sort of excuse for a mind that had problems well before seeing X or Y.
By the logic of regulating AI for these supposed dangers, you could argue that literature, movie content, comic books, YouTube videos and that much loved boogeyman in previous years of violent video games should all be banned or regulated for the content they express.
Such notions have a strongly nannyish, prohibitionist streak that's much more dangerous than some algorithm and the bullshit it spews to a few suggestible individuals.
The media of course loves such narratives, because their breathless hysteria and contrived fear-mongering plays right into more eyeballs. Seeing people again take seriously such nonsense after idiocies like the media frenzy around video games in the early 2000s and prior to that, similar media fits about violent movies and even literature, is sort of sad.
We don't need our tools for expression, and sources of information "regulated for harm" because a small minority of others can't get an easy grip on their psychological state.
I'd love to see evidence of mental instability in "everyone" and its presence in many people is in any case no justification for what are in effect controls on freedom of speech and expression, just couched in a new boogeyman.
Also, because AI is being relentlessly marketed as being better than humans, thereby encouraging people to trust it even more than they might a fellow human.
This is an appeal against innovation.
> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
As someone who has spent [an incredible amount of time reviewing other people's code](https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed), my perspective is that reviewing code is fundamentally slower than writing it oneself. The purpose of reviewing code is mentorship, investing in the community, and building trust, so that those reviewees can become autonomous and eventually help out with reviewing.
You get none of that from reviewing code generated by an LLM.
No it is not. It is arguing for using more stable and better documented tooling.
The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.
I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.
I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.
This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.
(There was a reason for this - the field was used elsewhere within a PowerBI model, and the clinicians couldn't get their heads around True/False, PowerBI doesn't have an easy way to map True/False values to strings, so we used 'Clinical/Non-Clinical' as string values).
I am reluctant to share the code example, because I'm preciously guarding an example of an LLM making an error in the hope that I'll be able to benchmark models using this, however here's the powerquery code (which you can put into excel) - ask an LLM to explain the code/predict what the output will look like, and compare it with what you get in excel.
let
MyTable = #table(
{"Foo"},
{
{"ABC"},
{"BCD"},
{"CDE"}
}
),
AddedCustom = Table.AddColumn(
MyTable,
"B",
each if Text.StartsWith([Foo], "LIAS") or Text.StartsWith([Foo], "B")
then "B"
else "NotB"
),
SortedRows = Table.Sort(
AddedCustom,
{{"B", Order.Descending}}
)
in SortedRows
I believe the issue arises because the column that sorts B/NotB is also called 'B' (i.e. the Clinical/Non-Clinical column was simply called 'Clinical', which is not an amazing naming convention).For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.
Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.
They used to do the same with Stack Overflow. But now it's more dangerous, because the code can be "subtly wrong in ways the user can't fathom" to order.
We're all guilty of copypasting from Stack Overflow, but as you said, that's not made to order. In order to use the code copied from there, you will likely have to edit it, at least a bit to fit your application, meaning that it does require a bit of understanding of what you're doing.
Since ChatGPT can be completely tuned to what you want without writing code, it's far more tempting to just copy and paste from it without auditing it.
I'm not a luddite, I'm perfectly fine with people using AI for writing code. The only thing that really concerns me is that it has the potential to generate a ton of shitty code that doesn't look shitty, creating a lot of surface area for debugging.
Prior to AI, the quantity of crappy code that could be generated was basically limited by the speed in which a human could write it, but now there's really no limit.
Again, just to reiterate, this isn't "old man yells at cloud". I think AI is pretty cool, I use it all the time, I don't even have a problem with people generating large quantities of code, it's just something we have to be a bit more weary of.
Unfortunately there was one particular edge case which caused that recursive call to become an infinite loop, and I was extremely embarrassed seeing that "stack overflow" server error alert come through Slack afterward.
if you have trusted processes for review and aren't always rushing out changes without triple checking your work (plus a review from another set of eyes), then I think you catch a lot of the subtler bugs that are emitted from an LLM.
If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.
The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.
It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.
This is not nearly as strong an argument as the author thinks it is.
> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).
This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.
The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.
That certainly punctures the hype. What are LLMs good for, if the best you can hope for is to spend years learning to prompt it for unreliable results?
A tool that helps you by iteratively guessing the next token is not a "developer tool" any more than a slot machine is a wealth buidling tool.
Even when I was using Visual Studio Ultimate (that has a fantastic step-through debugging environment), the debugger was only useful for the very initial tests, in order to correct dumb mistakes.
Finding dumb mistakes is a different order of magnitude of the dev process than building a complex edifice of working code.
Ironically, I used it to help the robots find a pretty deep bug in some code they authored in which the whole "this code isn't working, fix it" prompt didn't gain any traction. Giving them the code with the debug statements and the output set them on the right path. Easy peasy...true, they were responsible for the bug in the first place so I guess the humans who write bug free code have the advantage.
The output of the code print statments, as the code is iteratively built up from skeleton to ever greater levels of functionality, is analyzed to ensure that things are working properly, in a stepwise fashion. There is no guessing in this whatsoever. It is a logical design progression from minimal functionality to complete implementation.
Standard commercial computers never guess, so that puts constraints on my adding to their intrinsic logical data flows, i.e. I should never be guessing either.
> I guess the humans who write bug free code have the advantage.
We fanatical perfectionists are the only ones who write successful software, though perfection in function is the only perfection that can be attained. Other metrics about, for example, code structure, or implementation environment, or UI design, and the like, are merely ancillary to the functioning of the data flows.
And I need not guess to know this fundamental truth, which is common for all engineering endeavors, though software is the only engineering pursuit (not discipline, yet) where there is only a binary result: either it works perfectly as designed or it doesn't. We don't get to be "off by 0.1mm", unless our design specs say we have some grey area, and I've never seen that in all my years of developing/modifying various n-tiered RDBMS topologies, desktop apps, and even a materials science equipment test data capture system.
I saw the term "fuzzy logic" crop up a few decades ago, but have never had the occasion to use anything like that, though even that is a specific kind of algorithm that will either be implemented precisely or not.
Like, at a certain point, doing it yourself is probably less hassle.
Rather than the positive (code compiles), the negative (forgets about a core feature), can be extremely difficult to tell. Worse still, the feature can slightly drift, based upon code that's expected to be outside of the dialogue / context window.
I've had multiple times where the model completely forgot about features in my original piece of code, after it makes a modification. I didn't notice these missing / subtle changes until much later.
If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.
Now I am at the point that I am cleaning up the code and making it pretty. My script is less than 300 lines and Chatgpt regularly just leaves out whole chunks of the script when it suggests improvements. The first couple times this led to tons of head scratching over why some small change to make one thing more resilient would make something totally unrelated break.
Now I've learned to take Chatgpt's changes and diff it with the working version before I try to run it.
That's how aider commands the models to reply, for example.
For example, starting a SaaS project from something like Refine.dev + Ant Design, instead of just a blank slate.
Of course, none of what I build is even close to novel code, which helps.
> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.
Not only is this a massive bundle of assumptions but it's also just wrong on multiple angles. Maybe if you're only doing basic CRUDware you can spend five seconds and give a thumbs up but in any complex system you should be spending time deeply reading code. Which is naturally going to take longer than using what knowledge you already have to throw out a solution.
Ok sure it writes test code boiler plate for me.
Honestly the kind of work im doing requires that I understand the code im reading, more than have the ability to quickly churn out more of it.
I think probably an llm is going to greatly speed up Web development, or anything else where the impetus is on adding to a codebase quickly, as for maintaining older code, performing precise upgrades, and fixing bugs, so far ive seen zero benefits. And trust me, I would like my job to be easier! Its not like I've not tried to use these
But yes, once the codebase starts to grow ever so slightly the only use I found is a glorified autocomplete.
Overall it does save me time in writing code. But not in debugging.
Interestingly though, this only works if there is an error. There are cases where you will not get an error; consider a loosely typed programming language like JS or Python, or simply any programming language when some of the API interface is unstructured, like using stringly-typed information (e.g. Go struct tags.) In some cases, this will just silently do nothing. In other cases, it might blow up at runtime, but that does still require you to hit the code path to trigger it, and maybe you don't have 100% test coverage.
So I'd argue hallucinations are not always safe, either. The scariest thing about LLMs in my mind is just the fact that they have completely different failure modes from humans, making it much harder to reason about exactly how "competent" they are: even humans are extremely difficult to compare with regards to competency, but when you throw in the alien behavior of LLMs, there's just no sense of it.
And btw, it is not true that feeding an error into an LLM will always result in it correcting the error. I've been using LLMs experimentally and even trying to guide it towards solving problems I know how to solve, sometimes it simply can't, and will just make a bigger and bigger mess. Due to the way LLMs confidently pretend to know the exact answer ahead of time, presumably due to the way they're trained, they will confidently do things that would make more sense to try and then undo when they don't work, like trying to mess with the linker order or add dependencies to a target to fix undefined reference errors (which are actually caused by e.g. ABI issues.) I still think LLMs are a useful programming tool, but we could use a bit more reality. If LLMs were as good as people sometimes imply, I'd expect an explosion in quality software to show up. (There are exceptions of course. I believe the first versions of Stirling PDF were GPT-generated so long ago.) I mean, machine-generated illustrations have flooded the Internet despite their shortcomings, but programming with AI assistance remains tricky and not yet the force multiplier it is often made out to be. I do not believe AI-assisted coding has hit its Stable Diffusion moment, if you will.
Now whether it will or not, is another story. Seems like the odds aren't that bad, but I do question if the architectures we have today are really the ones that'll take us there. Either way, if it happens, I'll see you all at the unemployment line.
The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.
That still seems useful when you don't already know enough to come up with good search terms.
Why am I reminded of people who say you first have to become a biblical scholar before you can criticize the bible?
1. I know that a problem requires a small amount of code, but I also know it's difficult to write (as I am not an expert in this particular subfield) and it will take me a long time, like maybe a day. Maybe it's not worth doing at all, as the effort is not worth the result.
2. So why not ask the LLM, right?
3. It gives me some code that doesn't do exactly what is needed, and I still don't understand the specifics, but now I have a false hope that it will work out relatively easily.
4. I spend a day until I finally manage to make it work the way it's supposed to work. Now I am also an expert in the subfield and I understand all the specifics.
5. After all I was correct in my initial assessment of the problem, the LLM didn't really help at all. I could have taken the initial version from Stack Overflow and it would have been the same experience and would have taken the same amount of time. I still wasted a whole day on a feature of questionable value.
I'm tempted to pay someone in Poland or whatever another 500$ to just finish the project. Claude code is like a temp that has a code quota to reach. After they reach it, they're done. You've reached the context limit.
A lot of stuff is just weird. For example I'm basically building a website with Supabase. Claude does not understand the concept of shared style sheets, instead it will just re-implement the same style sheets over and over again on like every single page and subcomponent.
Multiple incorrect implementations of relatively basic concepts. Over engineering all over the place.
A part of this might be on Supabase though. I really want to create a FOSS project, so firebase, while probably being a better fit, is out.
Not wanting to burn out, I took a break after a 4 hour Claude session. It's like reviewing code for a living.
However, I'm optimistic soon a competitor will emerge with better pricing. I would absolutely love to run three coding agents at once, maybe it even a fourth that can run integration tests against the first three.
That's a great encapsulation of what sometimes feel like after a (highly productive) session with these tools. I get a lot done with them but wow it's exhausting!
Maybe a local SQL lite DB for notes would be ideal. Add a one sentence summary of what you did for each change, and then read it back before writing more code.
But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).
And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).
Even if one is very good at code review, I'd assume the vast majority of people would still end up with pretty different kinds of bugs they are better at finding while writing vs reviewing. Writing code and having it reviewed by a human gets both classes, whereas reviewing LLM code gets just one half of that. (maybe this can be compensated-ish by LLM code review, maybe not)
And I'd be wary of equating reviewing human vs LLM code; sure, the explicit goal of LLMs is to produce human-like text, but they also have prompting to request being "correct" over being "average human" so they shouldn't actually "intentionally" reproduce human-like bugs from training data, resulting in the main source of bugs being model limitations, thus likely producing a bug type distribution potentially very different to that of humans.
Should we even be asking AI to write code? Shouldn't we just be building and training AI to solve these problems without writing any code at all? Replace every app with some focused, trained, and validated AI. Want to find the cheapest flights? Who cares what algorithm the AI uses to find them, just let it do that. Want to track your calorie intake, process payroll every two weeks, do your taxes, drive your car, keep airplanes from crashing into each other, encrypt your communications, predict the weather? Don't ask AI to clumsily write code to do these things. Just tell it to do them!
Isn't that the real promise of AI?
Something we have learned as a civilization over the past ~70 years is that deterministic algorithms are an incredibly powerful thing. Designing processes that have a guaranteed, reliable result for a known input is a phenomenal way to scale up solutions to all kinds of problems.
If we want AI to help us with that, the best way to do that is to have it write code.
Um. No.
This is oversimplification that falls apart in any at minimum level system.
Over my career I’ve encountered plenty of reliability caused consequences. Code that would run but side effects of not processing something, processing it too slow or processing it twice would have serious consequences - financial and personal ones.
And those weren’t „nuclear power plant management” kind of critical. I often reminisce about educational game that was used at school and cost of losing a single save progress meant couple thousand dollars of reimbursement.
https://xlii.space/blog/network-scenarios/
This a cheatsheet I made for my colleagues. This is the thing we need to keep in mind when designing system I’m working on. Rarely any LLM thinks about it. It’s not a popular engineering by any sort, but it it’s here.
As for today I’ve yet to name single instance where any of ChatGPT produced code actually would save me time. I’ve seen macro generation code recommendation for Go (Go doesnt have macros), object mutations for Elixir (Elixir doesn’t have objects but immutable structs), list splicing in Fennel (Fennel doesn’t have splicing), language feature pragma ported from another or pure byte representation of memory in Rust and the code used UTF-8 string parsing to do it. My trust toward any non-ephemeral generated code is sub zero.
It’s exhausting and annoying. It feels like interacting with Calvin’s (of Calvin and Hobbes) dad but with all the humor taken away.
The more constraints we can place on its behavior, the harder it is to mess up.
If it's riskier code, constrain it more with better typing, testing, design, and analysis.
Constraints are to errors (including hallucinations) as water is to fire.
So he's also using LLMs to steer his writing style towards the lowest common denominator :)
If you don't, don't.
However, this 'lets move past hallucinations' discourse is just disingenuous.
The OP is conflating hallucinations, which are a fact, and undisputed failure mode of LLMs that no one has any solution for.
...and people not spending enough time and effort learning to use the tools.
I don't like it. It feels bad. It feels like a rage bait piece, cast out of frustration that the OP doesn't have an answer for hallucinations, because there isn't one.
> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.
People aren't stupid.
If they use a tool and it sucks, they'll stop using it and say "this sucks".
If people are saying "this sucks" about AI, it's because the LLM tool they're using sucks, not because they're idiots, or there's a grand 'anti-AI' conspiracy.
People are lazy; if the tool is good (eg. cursor), people will use it.
If they use it, and the first thing it does is hallucinate some BS (eg. intellij full line completion), then you'll get people uninstalling it and leaving reviews like "blah blah hallucination blah blah. This sucks".
Which is literally what is happening. Right. Now.
To be fair 'blah blah hallucinations suck' is a common 'anti-AI' trope that gets rolled out.
...but that's because it is a real problem
Pretending 'hallucinations are fine, people are the problem' is... it's just disingenuous and embarrassing from someone of this caliber.
If someone asks me a question about something I've worked on, I might be able to give an answer about some deep functionality.
At the moment I'm working with a LLM on a 3D game and while it works, I would need to rebuild it to understand all the elements of it.
For me this is my biggest fear - not that LLMs can code, but that they do so at such a volume that in a generation or two no one will understand how the code works.
> I think a simpler explanation is that hallucinating a non-existent library is a such an inhuman error it throws people. A human making such an error would be almost unforgivably careless.
This might explain why so many people see hallucinations in generated code as an inexcusable red flag.
Are these not considered hallucinations still?
As best I can tell, the only reason this term stuck is because early image generation looked super trippy.
I guess you could call bugs in LLM code "hallucinations", but they feel like a slightly different thing to me.
Good point
Well, those types of errors won't be happening next year will they?
> No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!
What rot. The test is the problem definition. If properly expressed, the code passing the test means the code is good.
Even better, this can carry on for a few iterations. And both LLMs can be:
1. Budgeted ("don't exceed X amount")
2. Improved (another LLM can improve their prompts)
and so on. I think we are fixating on how _we_ do things, not how this new world will do their _own_ thing. That to me is the real danger.
The only difference between that and writing SQL (as opposed to writing imperative code to query the database) is that the translation mechanism is much more sophisticated, much less energy efficient, much slower, and most significantly much more error-prone than a SQL interpreter.
But declarative coding is good! It has its issues, and LLMs in particular compound the problems, but it's a powerful technique when it works.
Reviewed against what? Who is writing the specs?
To be clear: this is not something I do currently, but my point is that one needs to detach from how _we_ engineers do this for a more accurate evaluation of whether these things truly do not work.
I've also tried Cursor with similar mixed results.
But I'll say that we are getting tremendous pressure at work to use AI to write code. I've discussed it with fellow engineers and we're of the opinion that the managerial desire is so great that we are better off keeping our heads down and reporting success vs saying the emperor wears no clothes.
It really feels like the billionaire class has fully drunk the kool-aid and needs AI to live up to the hype.
> I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.
People will pick solutions that have a lot of training data, rather than the best solution.
Ah so you mean... actually doing work. Yeah writing code has the same difficulty, you know. It's not enough to merely get something to compile and run without errors.
> With code you get a powerful form of fact checking for free. Run the code, see if it works.
No, this would be coding by coincidence. Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.
Then I decided to add on more functionality and asked for the ability to update all the other fields…
As you can guess, it gave me one endpoint per field for that entity. Sure, “it works”…
I actually do this (and I'm not proud of it)
Any entity, human or otherwise, lacking understanding of the problem being solved will, by definition, produce systems which contain some combination of defects, logic errors, and inapplicable functionality for the problem at hand.
Edit: oh and steel capped boots.
Edit 2: and a face shield and ear defenders. I'm all tuckered out like Grover in his own alphabet.
Image-generating AIs are really good at producing passable human forms, but they'll fail at generating anything realistic for dice, even though dice are just cubes with marks on them. Ask them to illustrate the Platonic solids, which you can find well-illustrated with a Google image search, and you'll get a bunch of lumps, some of which might resemble shapes. They don't understand the concepts: they just work off probability. But, they look fairly good at those probabilities in domains like human forms, because they've been specially trained on them.
LLMs seem amazing in a relatively small number of problem domains over which they've been extensively trained, and they seem amazing because they have been well trained in them. When you ask for something outside those domains, their failure to work from inductions about reality (like "dice are a species of cubes, but differentiated from other cubes by having dots on them") or to be able to apply concepts become patent, and the chainsaw looks a lot like an adze that you spend more time correcting than getting correct results from.
This feels like that: a "student" who can produce the right answers as long as you stick to a certain set of questions that he's already been trained on through repetition, but anything outside that set is hopeless, even if someone who understood that set could easily reason from it to the new question.
My cynical side suspects they may have been looking for
a reason to dismiss the technology and jumped at the first
one they found.
MY cynical side suggests the author is an LLM fanboi who prefers not to think that hallucinating easy stuff strongly implies hallucinating harder stuff, and therefore jumps at the first reason to dismiss the criticism.https://github.com/williamcotton/webdsl
Frankly us "fanbois" are just a little sick and tired of being told that we must be terrible developers working on simple toys if we find any value from these tools!
I love foss, I love browsing projects of all quality levels and vintages and seeing how things were built. I love learning new patterns and sometimes even bickering over their strengths and weaknesses. An LLM generated code base hardly makes me even want to engage with it...
Perhaps these feelings are somewhat analogous to hardcopies vs ebooks? My opinions have changed over time and I read and collect both. Have you had similar thoughts and gotten over them? Do you see tools like Claude in a way where this isn't an issue?
The grammar itself still seems a bit clunky and the next time I head down this path I imagine I'll go with a more hand-crafted approach.
I learned a lot about integrating Lua and jq into a project along the way (and how to make it performant), something I had no prior experience with.
Even the article of this thread says:
> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing.
What are the hardline performance characteristics being violated? Or functional incorrectness. Is this just "it's against my sensibilities" because at the end of the day frankly no one agrees on how to develop anything.
The thing I see a lot of developers struggle with is just because it doesn't fit your mental model doesn't make it objectively bad.
So unless it's objectively wrong or worse in a measurable characteristic I don't know that it matters.
For the record I'm not asserting it is right, I'm just saying I've seen a lot of critiques of LLM code boil down to "it's not how I'd write it" and I wager that holds for every developer you'll ever interact with.
I'm pretty sure the code not having the "if (…) lexer->line++" in places is just a plain simple repeated bug that'll result in wrong line numbers for certain inputs.
And human-wise I'd say the simple way to not have made that bug would've been to make/change abstractions upon the second or so time writing "if (…) lexer->line++" such that it takes effort to do it incorrectly, whereas the linked code allows getting it wrong by default with no indication that there's a thing to be gotten wrong. Point being that bad abstractions are not just a maintenance nightmare, but also makes doing code review (which is extra important with LLM code) harder.
Fine it's not the best and perhaps may run into some longer term issues but most importantly it works at this point in time.
A snobby/academic equivalent would be someone using an obscure language such as COBOL.
The world continues to turn.
If this is going to be your argument, you need a solid scientific approach. A study where N developers are given access to a tool vs N that are not, controls are in place etc.
Because the overwhelming majority of coders I speak to are saying exactly the same thing, which is LLMs are a small productivity boost. And the majority of cursor users, which is admittedly a much smaller number, are saying it just gets stuck playing whack a mole. And common sense says these are the expected outcomes, so we are going to need really rigorous work to convince people that LLMs can build 90% of most deeply technical projects. Exceptional results require exceptional evidence.
And when we do see anecdotal incidents that seem so divergent from the norm, well that then makes you wonder how that can be, is this really objective or are we in some kind of ideological debate?
And it's MIT:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
there are some very questionable things going on with the memory handing in this code. just saying.
What I meant was, that IMO the code is not very robust when dealing with memory allocations:
1. The "string builder" for example silently ignores allocation failures and just happily returns - https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
2. In what seems most of the places, the code simply doesnt check for allocation failures, which leads to overruns (just couple of examples):
https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
Great points about happy path allocations. If I ever touch the project again I’ll check each location.
Note to self: free code reviews of projects if you mention LLMs!
I hope you find yourself having a better day today than yesterday.
It says that Hallucinations are not a big deal, that there’s great dangers that are hard to spot in LLM-generated code… and then presents tips on fixing hallucinations with the general theme of positivity towards using LLMs to generate code, with no more time dedicated to the other dangers.
It sure gives the impression that the article itself was written by an LLM and barely edited by a human.
Absolutely not. If your testing requires a human to do testing, your testing has already failed. Your tests do need to include both positive and negative tests, though. If your tests don't include "things should crash and burn given ..." your tests are incomplete.
> If you’re using an LLM to write code without even running it yourself, what are you doing?
Running code through tests is literally running the code. Have code coverage turned on, so that you get yelled at for LLM code that you don't have tests for, and CI/CD that refuses to accept code that has no tests. By all means push to master on your own projects, but for production code, you better have checks in place that don't allow not-fully-tested code (coverage, unit, integration, and ideally, docs) to land.
The real problem comes from LLMs happily not just giving you code but also test cases. The same prudence applies as with test cases someone added to a PR/MR: just because there are tests doesn't mean they're good tests, or enough tests, review them in the assumption that they're testing the wrong thing entirely.
It's not hallucinating Jim, it's statistical coding errors. It's floating point rounding mistakes. It's the wrong cell in the excel table.
Slop is the best description. LLMs are sloppy tools and some people are not discerning enough to know that blindly running this slop is endangering themselves and others.
I ask for 2+5, you give me 10. Is that an error?
But then it turns out the user for this program wanted + to be a multiply operator, so the result is "correct".
But then it turns out that another user in the same company wanted it to mean "divide".
It seems to me to be _very_ rare when we can say for sure software contains errors or is error-free, because even at the extreme level of the spec there are just no absolutes.
The generality of "correctness" achieved by a human programmer is caused by generality of intent - they are trying to make the software work as well as possible for its users in all cases.
An LLM has no such intent. It just wants to model language well.
If LLMs output text, that is always the correct output, because it is programmed to extend a given piece of text by outputting tokens that translate to human readable text.
LLMs are only coincidentally correct sometimes when it is given a bit of text to extend and by some clever stopping and waiting for bits of text from a person it can render something that looks like a conversation and it reads like a cogent conversation. That is what they are programmed to do and they do it well.
The text being coherent but failing to conform to reality some way or another is just part of how they work. They are not failing, they are working as intended. They don't hallucinate or produce errors, they are merely sometimes coincidentally correct.
That's what I meant by my comment. Saying that the LLMs 'hallucinate' or 'are wrong about something' is incorrect. They are not producing errors. They are successfully doing what they were programmed to do. LLMs produce sloppy text that is sometimes coincidentally informative.
Hallucinating
Then the biggest mistake it could make is running `gh repo delete`
The linked article makes the claim that the majority of comp sci majors cannot write FizzBuzz. That's a bold assertion; how did the author sample such people? I suspect the sample pool was people applying for a position. There is a major selection bias there. First, people who fail many interviews will do more interviews than those who do not fail, so you'll start with a built-in bias towards the less competent (or more nervous).
Second, there is a large pile of money being given to people who make it over a somewhat arbitrary bar. As a random person, why would I not try to jump over the bar, even if I'm not particularly good at jumping? There are a lot of such bars with a lot of such large piles of money behind them. If getting a chance at jumping over those bars requires me to get a particular piece of paper with a particular title printed at the top of it, I'll be motivated to get that piece of paper too.
Why don't we see job positions for doctors and lawyers similarly flooded, then?
For lawyers, there is an oversupply of the most lucrative segments, and an undersupply everywhere else: https://www.ajs.org/is-there-a-shortage-of-lawyers/
But in both cases, there just isn't some low bar that you can finagle your way over and get to the promised riches. Lawyers have a literal Bar, and it isn't low. Doctors have a ton of required training. Both have serious certification requirements that computer science professionals do not. Both professions support my point.
Furthermore, incompetent lawyers face real-world tests. If they lose their cases or otherwise screw things up, they are not going to be raking in the money. And people are trying their best to flood the doctor market, by inventing certifications that avoid the requirements to be a physician and setting themselves up as alternative medicine specialists or naturalists or generic "healers" or whatever. (I'm not saying they're all crap, but I am saying that unqualified people are flooding those positions.)
That's the part I'm wondering about. Because it seems like I don't hear reports from people who would hire doctors and lawyers, of having to deal with that.
How did they get through the Leetcode-style interviews before LLMs and remote interviewing?