I describe it as more of a “mashup”, like an interpolation of statistically related output that was in the training data.
The thinking was in the minds of the people that created the tons of content used for training, and from the view of information theory there is enough redundancy in the content to recover much of the intent statistically. But some intent is harder to extract just from example content.
So when generating statically similar output, the statistical model can miss the hidden rules that were a part of the thinking that went into the content that was used for training.
Makes sense. Hidden rules such as, "recommending a package works only if I know the package actually exists and I’m at least somewhat familiar with it."
Now that I think about it, this is pretty similar to cargo-culting.
LLM’s are just so happy to generate enough tokens that look right ish. They need so many examples driven into them during training.
The map is not the territory, and we’re training them on the map of our codified outputs. They don’t actually have to survive. They’re pretty amazing but of course they’re absolutely not doing what we do, because success for us and them look so different. We need to survive.
(Please can we not have one that really wants to survive.)
I'm still a fan of the standard term "lying." Intent, or a lack thereof, doesn't matter. It's still a lie.
If someone told you it's Thursday when it's really Wednesday, we would not necessary say they lied. We would say they were mistaken, if the intent was to tell you the correct day of the week. If they intended to mislead you, then we would say they lied.
So intent does matter. AI isn't lying, it intends to provide you with accurate information.
Maybe we should call the output "synthetic lies" to distinguish it it from the natural lies produced by humans?
Summary from Wikipedia: https://en.m.wikipedia.org/wiki/Bullshit
> statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth
It's a perfect fit for how LLMs treat "truth": they don't know so that can't care.
Jake: You lied to me.
Elwood: Wasn't lies, it was just... bullshit.
Why are we making excuses for machines?
Because the OP's name seems way more descriptive and easier to generalize.
Personally, I'd be scared if LLMs were proven to be deliberately deceptive, but I think they currently fall in the two later camps, if we're doing human analogies.
Did the answers strike you as deceptive?
Oh yeah that's exactly what I want from a machine intelligence, a "best friend who knows everything about me," is that they just make shit up that they think I'd like to hear. I'd really love a personal assistant that gets me and my date a reservation at a restaurant that doesn't exist. That'll really spice up the evening.
The mental gymnastics involved in the AI community are truly pushing the boundaries of parody at this point. If your machines mainly generate bullshit, they cannot be serious products. If on the other hand they're intelligent, why do they make up so much shit? You just can't have this both ways and expect to be taken seriously.
Once you figure out how to do that they're absurdly useful.
Maybe a good analogy here is working with animals? Guide dogs, sniffer dogs, falconry... all cases where you can get great results but you have to learn how to work with a very unpredictable partner.
Name literally any other technology that works this way.
> Guide dogs, sniffer dogs, falconry...
Guide dogs are an imperfect solution to an actual problem: the inability for people to see. And dogs respond to training far more reliably than LLMs respond to prompts.
Sniffer dogs are at least in part bullshit and have been shown in many studies to respond to the subtle cues of their handlers far more reliably than anything they actually smell. And the best of part of them is they also (completely outside their own control mind you) ruin lives when falsely detecting drugs on cars that look a way the officer handling them thinks means they have drugs inside.
And falconry is a hobby.
Since you don't like my animal examples, how about power tools? Chainsaws, table saws, lathes... all examples of tools where you have to learn how to use them before they'll be useful to you.
(My inability to come up with an analogy you find convincing shouldn't invalidate my claim that "LLMs are unreliable technology that is still useful if you learn how to work with it" - maybe this is the first time that's ever been true for an unreliable technology, though I find that doubtful.)
The internet for one.
Not the internet itself (although it certainly can be unreliable), but rather the information on it.
Which I think is more relevant to the argument anyway, as LLM’s do in fact reliably function exactly the way they were built to.
Information on the internet is inherently unreliable. It’s only when you consider externalities (like reputation of source) that its information can then be made “reliable”.
Information that comes out of LLM’s is inherently unreliable. It’s only through externalities (such as online research) that its information can be made reliable.
Unless you can invent a truth machine that somehow can tell truth from fiction, I don’t see either of these things becoming reliable, stand-alone sources of information.
How about people? They make mistakes all the time, disobey instructions, don’t show up to work, occasionally attempt to embezzle or sabotage their employers. Yet we manage to build huge successful companies out of them.
Probabilistic prime number tests.
I'm being slightly facetious. Such tests differ from LLMs in the crucial respect that we can quantify their probability of failure. And personally I'm quite skeptical of LLMs myself. Nevertheless, there are techniques that can help us use unreliable tools in reliable ways.
I have read some posts of yours advancing that but I never met those with the details: do you mean more "prompt engineering", or "application selection", or "system integration"...?
[1] I built https://tools.simonwillison.net/hacker-news-thread-export this morning from my phone using that trick: https://claude.ai/share/7d0de887-5ff8-4b8c-90b1-b5d4d4ca9b84
[2] Examples of that here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#b...
[3] https://simonwillison.net/2024/Sep/25/o1-preview-llm/ is an early example of using a "reasoning" model for that
Or if you meant "what do you have to figure out to use them effectively despite their flaws?", that's a huge topic. It's mostly about building a deep intuition for what they can and cannot help with, then figuring out how to prompt them (including managing their context of inputs) to get good results. The most I've written about that is probably this piece: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
For documentation answering the newer long context models are wildly effective in my experience. You can dump a million tokens (easily a full codebase or two for most projects) into Gemini 2.5 Pro and get great answers to almost anything.
There are some new anonymous preview models with 1m token limits floating around right now which I suspect may be upcoming OpenAI models. https://openrouter.ai/openrouter/optimus-alpha
I actually use LLMs for command line arguments for tools like ffmpeg all the time, I built a plugin for that: https://simonwillison.net/2024/Mar/26/llm-cmd/
But the use of randomness inside the system should not prevent, in theory, as-if-full reliability - this stresses that the architecture could be unfinished, as I expressed with the example of RAG. (E.g.: well trained natural minds use check systems over provisional output, however obtained.)
> newer long context models
Practical question: if the query-contextual documentation needs to be part of the input (I am not aware of a more efficient way), does not that massively impact the processing time? Suppose you have to examine interactively the content of a Standard Hefty Document of 1MB of text... If so, that would make local LLM use prohibitive.
What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
We can compensate bad software architecture because we understand deeply the code details and make indirect couplings in the code. When we don't understand deeply the code, we need to compensate it with good architecture.
That means thinking about code in terms of interfaces, stores, procedures, behaviours, actors, permissions and competences (what the actors should do, how they should behave and the scope of action they should be limited to).
Then these details should reflect directly in the prompts. See how hard it is to make this process agentic, because you need user input in the agent inner workings.
And after running these prompts and with luck successfully extracting functioning components, you are the one that should be putting these components together to make working system.
Except that ladder is built on hallucinated rungs. Coding can be delegated to humans. Coding cannot be delegated to AI, LLM or ML because they are not real nor are they reliable.
LLM is like a developer without internet or docs access, who needs to write code on the paper. Every developer would hallucinate in that environment. It's a miracle that LLM does so much with so limited environment.
That's way more advanced than just coding interview questions that the solution could just be added to the dataset.
You need first to believe there is value in adding AI to your workflow. Then you need to search and find ways to have it add value to you. But you are ultimately the one that understands what value really is and who has to put effort into making AI valuable.
Vim won't make you a better developer just as much as LLMs won't code for you. But they can both be invaluable if you know how to wield them.
[0] https://developers.google.com/workspace/gmail/api/quickstart...
I’m sure you’re finding some use for it.
I can’t wait for when the LLM providers start including ads in the answers to help pay back all that VC money currently being burned.
Both Facebook and Google won by being patient before including ads. MySpace and Yahoo both were riddled with ads early and lost. It will be interesting to see who blinks first. My money is on Microsoft who anded ands to Solitaire of all things.
You can use AI to assist you with lower level coding, maybe coming up with multiple prototypes for a given component, maybe quickly refactoring some interfaces and see if they fit your mental model better.
But if you want AI to make your life easier I think you will have a hard time. AI should be just another tool in your toolbelt to make you more productive when implementing stuff.
So my question is, why do you expect LLMs to be 100% accurate to have any value? Shouldn't developers do their work and integrate LLMs to speed up some steps in coding process, but still taking ownership of the process?
Remember, there is no free lunch.
You’re not abstracting if you are generating code that you have to verify/fret about. You’re at exactly the same level as before.
Garbage collection is an abstraction. AI-generated C code that uses manual memory management isn’t.
100%. I like to say that we went from building a Millennium Falcon out of individual LEGO pieces, to instead building an entire LEGO planet made of Falcon-like objects. We’re still building, the pieces are just larger :)
I really wonder what the solution is.
Has there been any work on limiting the permissions of modules? E.g. by default a third-party module can't access disk or network or various system calls or shell functions or use tools like Python's "inspect" to access data outside what is passed to them? Unless you explicitly pass permissions in your import statement or something?
Components can't do any IO or interfere with any other components in an application except through interfaces explicitly given to them. So you could, e.g., have a semi-untrusted image compression component composed with the rest of your app, and not have to worry that it's going to exfiltrate user data.
1. splitting functionality in such way is not always possible or effective/performant, not to mention operators in practice tend to find fine grained access control super annoying
2. and more importantly, even if the architecture is working, hostile garbage in your pipeline WILL cause problems with the rest of your app.
An LLM might hallucinate the wrong permissions, but they're going to be plausible guesses.
It's extremely unlikely to hallucinate full network access for a module that has nothing to do with networking.
The LLM will happily write code that permits network access, because it read online an example that did that. And, unless you know better, you won't know to manually turn that off.
Sandboxed WebComponents does not solve anything if your LLM thinks it is helping when it lets the drawbridge down for the orcs.
And the article here is specifically about hallucinations, when it tries to plausibly fill something in according to a pattern.
Wrong information on the internet is as old as the internet...
But, I think we agree, anyway.
Even C allows library initializers running arbitrary code. It was used to implement that attack against ssh via malicious xz library.
Disabling globals that are not compile-time constants or at least are never initialized unless the application explicitly called things will nicely address that issue. But language designers think that running arbitrary code before main is a must.
One more point to consider Rust over C++.
I agree. And the problem has intensified due to the explosion of dependencies.
> Has there been any work on limiting the permissions of modules?
With respect to PyPI, npm, and the like, and as far as I know: no. But regarding C and generally things you can control relatively easily yourself, see for instance:
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
Also it's hard to argue against hard process isolation. Specter et al are much easier to defend against at process boundaries. It's probably higher value to make it easier to put sub modules into their own sandboxed processes.
Sure: the idea could be improved a lot. And then there is the maintenance burden. Here, perhaps a step forward would be if every package author would provide a "pledge" (or whatever you want to call the idea) instead of others trying to figure out what capabilities are needed. Then you could also audit whether a "pledge" holds in reality.
You can do SLSA, SBOM and package attestation with confirmed provenance.
But as mentioned it still is some work but more tools pop up.
Downside is when you will have signed attested package that still will become malicious just like malware creators were signing stuff with help of Microsoft.
e.g, "NullPointerException" can be a single kanji. Current LLM processes it like "N, "ull", "P", "oint", er", "Excep", "tion". This lets them make up "PullDrawerException", which is only useful outside code.
That kind of creativity is not useful in code, in which identifiers are just friendly names for pointer addresses.
I guess real question is how much business sense such a solution would make. "S in $buzzword stands for security" kind of thing.
You could have two different packages in a build doing similar things -- one uses less memory but is slower to compute than the other -- so used selectively by scenario from previous experience in production
If someone unfamiliar with the build makes a change and the assistant swaps the package used in the change -- which goes unnoticed as the package itself is already visible and the naming is only slightly different, it's easy to see how surprises can happen
(I've seen o3 do this every time the prompt was re-run in this situation)
then it gives me more hallucinations
correcting the latest hallucination results in it telling me the first hallucination
Therefore I tend to work on a one-shot prompt, and restart the session entirely each time, making tweaks to the prompt based on each output hoping to get a better result (I've found it helpful to point out the AI's past errors as "common mistakes to be avoided").
Doing the prompting in this way also vastly reduces the context size sent with individual requests (asking it to fix something it just made in conversation tends to resubmit a huge chunk of context and use up allowance quotas). Then, if there are bits the AI never quite got correct, I'll go in bit by bit and ask it to fix an individual function or two, with a new session and heavily pruned context.
Most of the time when I do use it, I almost always use just a couple prompts before starting a completely new one because it just falls off a cliff in terms of reliability after the first couple messages. At that point you're better off fixing it yourself than trying to get it to do it a way you'll accept.
1. Ask a question / present a problem, but usually without enough context to the problem and solution space they want to zero in on.
2. The AI does an honest job given the context, but is off alignment in some specific way that the user did not clarify initially up front.
3. Asks the AI to correct for this, along with some multiple other requests for changes toward the solution they want.
4, 5, 6 Loop. They get a response, like the corrections (sometimes) and continue to make changes, in back-and-forth-conversation like interaction, only copying out corrected code blocks and copying in specific code chunks for correction.
7+. The output gets progressively worse and worse, undoing corrections/changes/modifications that were previously discussed.
At this point I try to interrupt the spiraling death loop and ask the user:
- (rhetorical) why are you talking to the AI like a human being?
- What is in your context window at this point in the conversation?
If they can answer the context window question, AND understand how the AI ingests input and produces output, usually its a lightbulb moment. If they don't quite realize that they are polluting their context window, then I try to get them to be aware that everything in the context window is statistically weighted and will affect the output. If a tainted input is provided, the chances of an untainted output are lower than otherwise. You want to provide high quality context window input, ideally fully control it. That means, you do NOT want to have a conversation with the AI for real work; you need to embrace `zero shotting` everything you ask. This approach maximizes exactly what the AI are best trained for, trained on, how they are trained, and how they `understand` things.
This requires a lot more hand holding and curating prompting, ie prompt engineering, than people will honestly realize/admit to. Prompt engineering isn't black magic, its intelligent contextualization that plays into the strengths of the implicit knowledge AI has. Worst things for a LLM super user?
- copy-paste tedium (doing it by hand)
- RAG auto-compression (letting an algorithm determine critical context decisions)
- opaque context window systems (how is the conversation stored and presented to the LLM each turn?)
- system prompt inaccessibility in certain online providers (system prompt is still super critical for driving)
- general `magic` behavior exhibited when using a plain/simple chat interface (this is usually unraveled ONLY by understanding the full context window)
The only LLM that has been SUCCESSFUL at conversing with me and maintaining state through the flowing conversation has been the newest Gemini 2.5 Pro offering, and ONLY up to 100K out of 1M context window. I have had (very minor) forgetting after 100K, and I deep dove into the conversation at that point to understand what was going on, and it appears that the conceptual conversation compression is in some way, lossy losing some conversation bits.
Every other LLM has had the facade of maintaining conversation state, but only Gemini 2.5 Pro Preview has actually held that up (with firm limitations!). I suspect that large context window optimization/compression is to blame, some providers are aggressive with it.
If you happen to like using less popular frameworks, libraries, packages etc it's like fighting an uphill battle because it will constantly try to inject what it interprets as the most common way to do things.
I do find it useful for smaller parts of features or writing things like small utilities or things at a scale where it's easy to manage/track where it's going and intervene
But full on vibe coding auto accept everything is madness whenever I see it.
Either they don't retain previous information, or they are so desperate to give you any answer that they'd prefer the wrong answer. Why is it that an LLM can't go: Yeah, I don't know.
human input/review/verification/validation is always required. verify the untrusted output of these systems. don’t believe the hype and don’t blindly trust them.
—
i did find the fact that google search’s assistant just parroted the crafted/fake READMEs thing particularly concerning - propagating false confidence/misplaced trust - although it’s not at all surprising given the current state of things.
genuinely feel like “classic search” and “new-fangled LLM queries” need to be split out and separated for low-level/power user vs high-level/casual questions.
at least with classic search i’m usually finding a github repo fairly quickly that i can start reading through, as an example.
at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
but, are humans not also a magic black box? We don't know what's going on in other people's heads, and while you can communicate with a human and tell them to do something, they are prone to misunderstanding, not listening, or lying. (which is quite similar to how LLMs behave!)
> at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
yes, us humans have similar issues to the magic black box. i’m not arguing humans are perfect.
this is why we have human code review, tests, staging environments etc. in the release cycle. especially so in safety/security critical contexts. plus warnings from things like register articles/CVEs to keep track of.
like i said. don’t blindly trust the untrusted output (code) of these things — always verify it. like making sure your dependencies aren’t actually crypto miners. we should be doing that normally. but some people still seem to believe the hype about these “magic black box oracles”.
the whole “agentic”/mcp/vibe-coding pattern sounds completely fucking nightmare-ish to me as it reeks of “blindly trust everything LLM throws at you despite what we’ve learned in the last 20 years of software development”.
Vibe coding is all about deciding it doesn’t matter if the implementation is perfect. And that’s true for some things!
i was going to say, sure yeah i’m currently building a portfolio/personal website for myself in react/ts, purely for interview showing off etc. probably a good candidate for “vibe coding”, right? here’s the problem - which is explicitly discussed in the article - vibe coding this thing can bring in a bunch of horrible dependencies that do nefarious things.
so i’d be sitting in an interview showing off a few bits and pieces and suddenly their CPU usage spikes at 100% util over all cores because my vibe-coded personal site has a crypto miner package installed and i never noticed. maybe it does some data exfiltration as well just for shits and giggles. or maybe it does <insert some really dark thing here>.
“safety and security critical” applies in way more situations than people think it does within software engineering. so many mundane/boring/vibe-it-out-the-way things we do as software engineers have implicit security considerations to bear in mind (do i install package A or package B?). which is why i find the entire concept of “vibe-coding” to be nightmarish - it treats everything as a secondary consideration to convenience and laziness, including basic and boring security practices like “don’t just randomly install shit”.
I don't know about you, but for most people theory of mind develops around age 2...
Can't really remember what is was exactly anymore, something in Apple's Vision libraries that just kept popping up if I didn't explicitly say to not use it.
Might also want to support multiple allow lists, so one can add to a standard list in a project (after review). And also deny, so one can remove a few without exiting completely from common lists.
It is shitloads of work to maintain.
Getting new package from 0 to any Linux distribution is close to impossible.
Debian sucks as no one gets on top of reviewing and testing.
„Can we just” is not just there is loads of work to be done to curate packages no one is willing to pay for it.
There is so far no model that works where you can have up to date cutting edge stuff reviewed. So you are stuck with 5 year old crap because it was reviewed.
I use it... in my shell? Using various shortcuts? Ctrl+T to select file? Alt+C to change dir? Ctrl+R to search history? I use this for my shell integration:
function maybe_source { [ -f "$1" ] && source "$1" }
maybe_source /usr/share/doc/fzf/examples/key-bindings.zsh
maybe_source /usr/share/doc/fzf/examples/completion.zsh
https://github.com/junegunn/fzf/tags https://tracker.debian.org/pkg/fzf
The bot hallucinated a non-existent mongoDB Powershell cmdlet, complete with documentation on how it works, and then spat out a "solution" to the problem I asked. Every time I reworked the prompt, cut it up into smaller chunks, narrowed the scope of the problem, whatever I tried, the chatbot kept flatly hallucinating non-existent cmdlets, Python packages, or CLI commands, sometimes even providing (non-working) "solutions" in languages I didn't explicitly ask for (such as bash scripting instead of Powershell).
This was at a large technology company, no less, one that's "all-in" on AI.
If you're staying in a very narrow line with a singular language throughout and not calling custom packages, cmdlets, or libraries, then I suspect these things look and feel quite magical. Once you start doing actual work, they're complete jokes in my experience.
I for one do not want my libraries APIs defined by the median person commenting about code of making questions on Stack Overflow.
Also, every time I see people using LLMs output as a starting point for software architecture the results became completely useless.
That's actually hilarious.
It's quirks like these that prove LLMs are a long long way from AGI.
I remember, fresh out of college, being shocked by the amount of bugs in open source.
A lot of model training these days uses synthetic data. Generating good code synthetic data is a whole lot easier than any other category, as you can at least ensure the code you're generating is gramatically valid and executes without syntax errors.
Read this instead, it's the technical report that is only linked to and barely mentioned in the article: https://socket.dev/blog/slopsquatting-how-ai-hallucinations-...
Totally understand the skepticism. It’s easy to assume commercial motives are always front and center. But in this case, the company actually came after the problem. I’ve been deep in this space for a long time, and eventually it felt like the best way to make progress was to build something focused on it full-time.
there’s also some info from Python software foundation folks in the register article, so it’s not just a socket pitch article.
"Socket addresses this exact problem. Our platform scans every package in your dependency tree, flags high-risk behaviors like install scripts, obfuscated code, or hidden payloads, and alerts you before damage is done. Even if a hallucinated package gets published and spreads, Socket can stop it from making it into production environments."
False positives where it incorrectly flagged a safe package would result in the need for a human review step, which is even more expensive.
False negatives where malware patterns didn't match anything previously would happen all the time, so if people learned to "trust" the scanning they would get caught out - at which point what value is the scanning adding?
I don't know if there are legal liability issues here too, but that would be worth digging into.
As it stands, there are already third parties that are running scans against packages uploaded to npm and PyPI and helping flag malware. Leaving this to third parties feels like a better option to me, personally.
Seems too late to me. At this point the module/package was already added into the ecosystem, it could potentially be some time (months?) before it is flagged by third party and removed.
The word "just" here always presumes magic that does not actually exist.
The magic here is, yes, AI. If you look at the mobile app stores, they've all become much better, although false positives occur, of course.
There are a small number of PyPI things they require human support queues at the moment and they are sometimes overwhelmed already.
The situation described in the article is similar to having junior developers we don't trust committing code and us releasing it to production and blaming the failure on them.
If a junior on the team does something dumb and causes a big failure, I wonder where the senior engineers and managers were during that situation. We closely supervise and direct the work of those people until they've built the skills and ways of thinking needed to be ready for that kind of autonomy. There are reasons we have multiple developers of varying levels of seniority: trust.
We build relationships with people, and that is why we extend them the trust. We don't extend trust to people until they have demonstrated they are worthy of that trust over a period of time. At the heart of relationships is that we talk to each other and listen to each other, grow and learn about each other, are coachable, get onto the same page with each other. Although there are ways to coach llm's and fine tune them, LLM's don't do nearly as good of a job at this kind of growth and trust building as humans do. LLM's are super useful and absolutely should be worked into the engineering workflow, but they don't deserve the kind of trust that some people erroneously give to them.
You still have to care deeply about your software. If this story talked about inexperienced junior engineers messing up codebases, I'd be wondering where the senior engineers and leadership were in allowing that to mess things up. A huge part of engineering is all about building reliable systems out of unreliable components and always has been. To me this story points to process improvement gaps and ways of thinking people need to change more than it points to the weak points of AI.
Does having a coworker automatically make a person dumb and no longer willing or able to grow? Does an engineer who becomes a manager instantly lose their ability to work or grow or learn? Sometimes, yes I know, but it’s not a foregone conclusion.
Agents are a new tool in our arsenal and we get to choose how we use them and what it will do for us, and what it will do to us, each as individuals.
Change of roles is a twist I didn’t suggest, it’s not related to my argument. I was talking about an engineering role. I’m not seeing an analogy with what you’re suggesting. Even less so does your suggested “immediately” resonate with me. Such transitions are rarely is immediate. Growth on an alternative career path is a different story.
The problem that I see here is that we’re not given that choice you’re considering. Take for example the recent Shopify pivot. It is now expected by the management because they believe the exaggerated hype, especially motivated during the ongoing financing crunch - in many places. So it’s not a lawnmower we’re talking about here but an oracle one would need to be capable of challenging.
This idea of programs fetching reams of needed stuff from the cloud somewhere is a real scourge in programming.
I think people in software envy real-engineering too much. Software development is what it is. If it does not live up to that bar then so be it. But AI-for-code-generation (“AI” for short now) really drops any kind of pretense. I got into software because it was supposed to be analytic, even kind of a priori. And deterministic. What even is AI right now? It melds the very high tech and probabilistic (AI tech) with the low tech of code generation (which is deterministic by itself but not with AI). That’s a regression both in terms of craftmanship (code generation) and so-called engineering (deterministic). I was looking forward to higher-level software development: more declarative (better programming languages and other things), more tool-assisted (tests, verification), more deterministic and controlled (Nix?), and less process redundancies (e.g. less redundancies in manual/automated testing, verification, review, auditing). Instead we are mining the hard work of the past three decades and spitting out things that have the mandatory label “this might be anything, verify it yourself”. We aren’t making higher-level tools—we[1] are making a taller tower with less support beams, until the tower reaches so high that the wind can topple it at any moment.
The above was just for AI-for-code-generation. AI could perhaps be used to create genuinely higher level processes. A solid structure with better support. But that’s not the current trajectory/hype.
[1] ChatGPT em-dash alert. https://news.ycombinator.com/item?id=43498204
So I gave it a spin, and after the past couple days, it’s been the most terrible IDE experience so far. The LLMs are always in the way, I’ve got Claude 3.5, 3.7, o1, o3-mini, o4, Gemini 2.0-flash, 2.5-pro, with/without reasoning, own models. Embedded Copilot is bugged, editor/agentic Copilot is bugged - it breaks your code if you selectively reject suggestions, your file buffer gets mangled, need to revert everything completely even if something was useful. Sidebar chat can get just as confusing as before. Typescript, Python, Java, Kotlin, Go. Rust won’t even compile, and don’t get me started on C++ codebases. Never had it type-check with mypy and pylance.
In many cases even with codebases, extra MCP servers and fetching remote docs, it’s just not up to the task of making code even type check. Sometimes it just times out or fails on network errors. Very fragile, unreliable, misleading.
I don’t know what people vibe code, but for a variety of codebases I’ve had to work on, it’s just in the way, injecting nonsense or outright garbage, which I need to reject every time. It’s useful as a sed alternative, but a less reliable one, a regex is often faster than 3+ prompts and waiting. It breaks my flow of conscience, I lose creativity and need to check everything after it. Dunning-Kruger maybe.
To me that workflow is but a tailored and integrated StackOverflow, with snippets adapted to your code. Not sure how productive it is to let snippet insertions interfere with your flow, but very helpful when you forget or stumble.
The more people rely on it, the more surprise awaits around the corner the moment AI fails. Now devs rely more on the network link instead of their brain, just like when people used to vibe code using StackOverflow. Creativity is at the stake. It must be kept in tone to stay productive.
There’s a lot of work that’s a waste of time. If the goal is to replace devs, such companies will lose money in the end. If the goal is to assist devs and make them more productive, the LLMs need to adapt to take over such tasks reliably, e.g., scaffolding, standard algorithms, “best practices,” simulating and questioning design/architecture, and the UX must improve.
Sure, with well written prompts you can have some success using AI assistants for things, but also with well-written non-ambiguous prompts you can inexplicably end up with absolute garbage.
Until things become consistent, this sort of generative AI is more akin to a party trick than being able to replace or even supplement junior engineers.
If an LLM spits out code that uses a dependency you aren't familiar with, it's your job to review that dependency before you install it. My lowest effort version of this is to check that it's got a credible commit and release history and evidence that many other people are using it already.
Same as if some stranger opens a PR against your project introducing a new-to-you dependency.
If you don't have the discipline to do good code review, you shouldn't be using AI-assisted programming outside of safe sandbox environments.
(Understanding "safe sandbox environment" is a separate big challenge!)
Lmfaooo
I suspect that this style of development became popular in the first place because the LGPL has different copyright implications based on whether code is statically or dynamically linked. Corporations don't want to be forced to GPL their code so a system that outsources libraries to random web sites solves a legal problem for them.
But it creates many worse problems because it involves linking your code to code that you didn't write and don't control. This upstream code can be changed in a breaking way or even turned into malware at any time but using these dependencies means you are trusting that such things won't happen.
Modern dependency based software will never "just work" decades from now like all of that COBOL code from the 1960s that infamously still runs government and bank computer systems on the backend. Which is probably a major reason why they won't just rewrite the COBOL code.
You could say as a counterargument that operating systems often include breaking changes as well. Which is true but you don't update your operating system on a regular basis. And the most popular operating system (Windows) is probably the most popular because Microsoft historically has prioritized backward compatibility even to the extreme point of including special code in Windows 95 to make sure it didn't break popular games like SimCity that relied on OS bugs from Windows 3.1 and MS-DOS[0].
[0]: https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...
Turning this around: a great use case is to ask AI to review documents, APIs, etc. AI is really great for teasing out your blindspots.
The wisdom of the crowd only works for the end result not if you consider every given answer, then you get more wrong answers because you fall to the average.