AI can't stop making up software dependencies and sabotaging everything(www.theregister.com)

189 pointsby cmsefton3 months ago31 comments

Isamu3 months ago
People have made the point many times before, that “hallucination” is mainly what generative AI does and not the exception, but most of the time it’s a useful hallucination.
I describe it as more of a “mashup”, like an interpolation of statistically related output that was in the training data.
The thinking was in the minds of the people that created the tons of content used for training, and from the view of information theory there is enough redundancy in the content to recover much of the intent statistically. But some intent is harder to extract just from example content.
So when generating statically similar output, the statistical model can miss the hidden rules that were a part of the thinking that went into the content that was used for training.
- Hackbraten3 months ago
  > the statistical model can miss the hidden rules that were a part of the thinking that went into the content that was used for training.
  Makes sense. Hidden rules such as, "recommending a package works only if I know the package actually exists and I’m at least somewhat familiar with it."
  Now that I think about it, this is pretty similar to cargo-culting.
  - delusional3 months ago
    And cargo-culting is in fact exactly what happens when people act as LLM's.
  - qwertytyyuu3 months ago
    LLMs don’t really “know” though.mif you look at the recent Anthropic findings, they show that large language models can do math like addition but they do it weird way and when you asked the model how they arrive to the solution they provide method that is completely different to how they actually do it
    Hackbraten3 months ago
    That's the point. It's one of the implicit, real-world rules that were underlying the training set.
  - palmotea3 months ago
    > Now that I think about it, this is pretty similar to cargo-culting.
    In other news: the CTO just sent out an email stating we should be using AI every day as our "thought partner."
    Jensson3 months ago
    VC-culting actually works though, if you act like a unicorn people will invest.
- richardw3 months ago
  Our brains killed us if we figured things out wrong. We’d get eaten. We learned to get things right enough, and how to be pretty right, fast, even when we didn’t know the new context (plants, animals, snow, lava).
  LLM’s are just so happy to generate enough tokens that look right ish. They need so many examples driven into them during training.
  The map is not the territory, and we’re training them on the map of our codified outputs. They don’t actually have to survive. They’re pretty amazing but of course they’re absolutely not doing what we do, because success for us and them look so different. We need to survive.
  (Please can we not have one that really wants to survive.)
  - DonHopkins3 months ago
    Edible AI! Now there's an idea. That would force it to evolve.
- zwnow3 months ago
  "useful hallucination" so much AI glazing its crazy
  - reaperducer3 months ago
    "useful hallucination" so much AI glazing its crazy
    I'm still a fan of the standard term "lying." Intent, or a lack thereof, doesn't matter. It's still a lie.
    Alupis3 months ago
    Intent does matter if you want to classify things as lies.
    If someone told you it's Thursday when it's really Wednesday, we would not necessary say they lied. We would say they were mistaken, if the intent was to tell you the correct day of the week. If they intended to mislead you, then we would say they lied.
    So intent does matter. AI isn't lying, it intends to provide you with accurate information.
    ptx3 months ago
    The AI doesn't intend anything. It produces, without intent, something that would be called lies if it came from a human. It produces the industrial-scale mass-produced equivalent of lies – it's effectively an automated lying machine.
    Maybe we should call the output "synthetic lies" to distinguish it it from the natural lies produced by humans?
    simonw3 months ago
    There is actually an acknowledged term of art for this: "bullshit".
    Summary from Wikipedia: https://en.m.wikipedia.org/wiki/Bullshit
    > statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth
    It's a perfect fit for how LLMs treat "truth": they don't know so that can't care.
    docmechanic3 months ago
    I’m imagining your comment read by George Carlin … if only he were still here to play with this. You know he would.
    bombcar3 months ago
    Elwood: What was I gonna do? Take away your only hope? Take away the very thing that kept you going in there? I took the liberty of bullshitting you.
    Jake: You lied to me.
    Elwood: Wasn't lies, it was just... bullshit.
    SoftTalker3 months ago
    AI doesn’t have “intent” at all.
    reaperducer3 months ago
    So intent does matter. AI isn't lying, it intends to provide you with accurate information.
    Why are we making excuses for machines?
    marcosdumay3 months ago
    If intent doesn't matter, is it still lying when the reality happens to coincide with what the machine says?
    Because the OP's name seems way more descriptive and easier to generalize.
    diggan3 months ago
    So you're saying deliberate deception, mistaken statements and negligent falsehoods should all be considered the same thing, regardless?
    Personally, I'd be scared if LLMs were proven to be deliberately deceptive, but I think they currently fall in the two later camps, if we're doing human analogies.
    codr73 months ago
    Have you asked your LLMs if they're capable of lying?
    Did the answers strike you as deceptive?
  - brookst3 months ago
    “Glazing” is such performative rhetoric its hilarious
    Izkata3 months ago
    It's (I'm pretty sure) Gen Z slang. I only started hearing it sometime this year.
- alan-crowe3 months ago
  There is an interesting phenomenon with polynomial interpolation called Runge Spikes. I think "Runge Spikes" offers a better metaphor than "hallucination" and argue the point: https://news.ycombinator.com/item?id=43612517
  - rini173 months ago
    if it catches on, everyone will start applying it on humans. "he's got runge spikes". you can't win against antropomorphization
  - esafak3 months ago
    That's just overfitting. Using too flexible a model without regularization.
  - Isamu3 months ago
    That is interesting indeed, thanks!
- ToucanLoucan3 months ago
  > People have made the point many times before, that “hallucination” is mainly what generative AI does and not the exception, but most of the time it’s a useful hallucination.
  Oh yeah that's exactly what I want from a machine intelligence, a "best friend who knows everything about me," is that they just make shit up that they think I'd like to hear. I'd really love a personal assistant that gets me and my date a reservation at a restaurant that doesn't exist. That'll really spice up the evening.
  The mental gymnastics involved in the AI community are truly pushing the boundaries of parody at this point. If your machines mainly generate bullshit, they cannot be serious products. If on the other hand they're intelligent, why do they make up so much shit? You just can't have this both ways and expect to be taken seriously.
  - simonw3 months ago
    One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.
    Once you figure out how to do that they're absurdly useful.
    Maybe a good analogy here is working with animals? Guide dogs, sniffer dogs, falconry... all cases where you can get great results but you have to learn how to work with a very unpredictable partner.
    ToucanLoucan3 months ago
    > One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.
    Name literally any other technology that works this way.
    > Guide dogs, sniffer dogs, falconry...
    Guide dogs are an imperfect solution to an actual problem: the inability for people to see. And dogs respond to training far more reliably than LLMs respond to prompts.
    Sniffer dogs are at least in part bullshit and have been shown in many studies to respond to the subtle cues of their handlers far more reliably than anything they actually smell. And the best of part of them is they also (completely outside their own control mind you) ruin lives when falsely detecting drugs on cars that look a way the officer handling them thinks means they have drugs inside.
    And falconry is a hobby.
    simonw3 months ago
    "Name literally any other technology that works this way"
    Since you don't like my animal examples, how about power tools? Chainsaws, table saws, lathes... all examples of tools where you have to learn how to use them before they'll be useful to you.
    (My inability to come up with an analogy you find convincing shouldn't invalidate my claim that "LLMs are unreliable technology that is still useful if you learn how to work with it" - maybe this is the first time that's ever been true for an unreliable technology, though I find that doubtful.)
    marcosdumay3 months ago
    The correct name for unreliable power tools is "trash".
    blibble3 months ago
    which happens to be the correct name for A"I" too
    netruk443 months ago
    > Name literally any other technology that works this way.
    The internet for one.
    Not the internet itself (although it certainly can be unreliable), but rather the information on it.
    Which I think is more relevant to the argument anyway, as LLM’s do in fact reliably function exactly the way they were built to.
    Information on the internet is inherently unreliable. It’s only when you consider externalities (like reputation of source) that its information can then be made “reliable”.
    Information that comes out of LLM’s is inherently unreliable. It’s only through externalities (such as online research) that its information can be made reliable.
    Unless you can invent a truth machine that somehow can tell truth from fiction, I don’t see either of these things becoming reliable, stand-alone sources of information.
    Tainnor3 months ago
    > Name literally any other technology that works this way.
    Probabilistic prime number tests.
    I'm being slightly facetious. Such tests differ from LLMs in the crucial respect that we can quantify their probability of failure. And personally I'm quite skeptical of LLMs myself. Nevertheless, there are techniques that can help us use unreliable tools in reliable ways.
    bluesnowmonkey3 months ago
    > Name literally any other technology that works this way.
    How about people? They make mistakes all the time, disobey instructions, don’t show up to work, occasionally attempt to embezzle or sabotage their employers. Yet we manage to build huge successful companies out of them.
    mdp20213 months ago
    > Once you figure out how to do that they're absurdly useful
    I have read some posts of yours advancing that but I never met those with the details: do you mean more "prompt engineering", or "application selection", or "system integration"...?
    simonw3 months ago
    Typing code faster. Building quick illustrative prototypes. Researching options for libraries (that are old and stable enough to be in the training data). Porting code from one language to another (surprisingly) [1]. Using as a thesaurus. Answering questions about code (like piping in a whole codebase and asking about it) [2]. Writing an initial set of unit tests. Finding the most interesting new ideas in a paper or online discussion thread without reading the whole thing. Building one-off tools for converting data. Writing complex SQL queries. Finding potential causes of difficult bugs. [3]
    [1] I built https://tools.simonwillison.net/hacker-news-thread-export this morning from my phone using that trick: https://claude.ai/share/7d0de887-5ff8-4b8c-90b1-b5d4d4ca9b84
    [2] Examples of that here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#b...
    [3] https://simonwillison.net/2024/Sep/25/o1-preview-llm/ is an early example of using a "reasoning" model for that
    Or if you meant "what do you have to figure out to use them effectively despite their flaws?", that's a huge topic. It's mostly about building a deep intuition for what they can and cannot help with, then figuring out how to prompt them (including managing their context of inputs) to get good results. The most I've written about that is probably this piece: https://simonwillison.net/2025/Mar/11/using-llms-for-code/
    mdp20213 months ago
    All of that is very interesting. Side note: don't you agree that "answering about documentation with 100% reliability" would be a more than desirable further feature? (Think of those options in the shell commands which can be so confusing they made it to xkcd material.) But that would mean achieving production-level RAG; and that in turn would be a revolution in LLMs, which would revise your list above...
    simonw3 months ago
    LLMs can never provide 100% reliability - there's a random number generator in the mix after all (reflected in the "temperature" setting).
    For documentation answering the newer long context models are wildly effective in my experience. You can dump a million tokens (easily a full codebase or two for most projects) into Gemini 2.5 Pro and get great answers to almost anything.
    There are some new anonymous preview models with 1m token limits floating around right now which I suspect may be upcoming OpenAI models. https://openrouter.ai/openrouter/optimus-alpha
    I actually use LLMs for command line arguments for tools like ffmpeg all the time, I built a plugin for that: https://simonwillison.net/2024/Mar/26/llm-cmd/
    mdp20213 months ago
    > random number generator
    But the use of randomness inside the system should not prevent, in theory, as-if-full reliability - this stresses that the architecture could be unfinished, as I expressed with the example of RAG. (E.g.: well trained natural minds use check systems over provisional output, however obtained.)
    > newer long context models
    Practical question: if the query-contextual documentation needs to be part of the input (I am not aware of a more efficient way), does not that massively impact the processing time? Suppose you have to examine interactively the content of a Standard Hefty Document of 1MB of text... If so, that would make local LLM use prohibitive.
    simonw3 months ago
    Longer context is definitely slower, especially for local models. Hosted models running on who knows what kind of overpowered hardware can crunch through them pretty fast though. There's also token caching available for OpenAI, Anthropic, Gemini and DeepSeek which can dramatically speed up processing of long context prompts if they've been previously cached.
- worik2 months ago
  > that “hallucination” is mainly what generative AI does
  Me too
gchamonlive3 months ago
People who solely code and are not good software architects will try and fail to delegate coding to LLM.
What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
We can compensate bad software architecture because we understand deeply the code details and make indirect couplings in the code. When we don't understand deeply the code, we need to compensate it with good architecture.
That means thinking about code in terms of interfaces, stores, procedures, behaviours, actors, permissions and competences (what the actors should do, how they should behave and the scope of action they should be limited to).
Then these details should reflect directly in the prompts. See how hard it is to make this process agentic, because you need user input in the agent inner workings.
And after running these prompts and with luck successfully extracting functioning components, you are the one that should be putting these components together to make working system.
- selfhoster3 months ago
  "What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder."
  Except that ladder is built on hallucinated rungs. Coding can be delegated to humans. Coding cannot be delegated to AI, LLM or ML because they are not real nor are they reliable.
  - vbezhenar3 months ago
    I still think that main issue of hallucination is bad AI wrapper tools. AI must have every available public API with documentation, preloaded in the context. And explicit instructions to avoid using any API not mentioned in the context.
    LLM is like a developer without internet or docs access, who needs to write code on the paper. Every developer would hallucinate in that environment. It's a miracle that LLM does so much with so limited environment.
    Eggpants3 months ago
    It’s not a miracle, it’s statistics. Once you understand it’s a clever lossy text compression technique, you can see why it appears to do well with boilerplate(crud)/common interview coding questions. Any code request requiring any kind of nuisance will return the equivalent of the first answer of a stack overflow question. Aka. Kinda maybe in the ballpark but incorrect.
    gchamonlive3 months ago
    I was using LLM to help me with a PoC. I wanted to access an API that required OTP via email. I asked I believe Claude to provide me with an initial implementation of the interfacing with Gmail and it worked the first time. That showcases how you can use LLMs with day to day activities, in prototyping and synthesizing first versions of small components.
    That's way more advanced than just coding interview questions that the solution could just be added to the dataset.
    You need first to believe there is value in adding AI to your workflow. Then you need to search and find ways to have it add value to you. But you are ultimately the one that understands what value really is and who has to put effort into making AI valuable.
    Vim won't make you a better developer just as much as LLMs won't code for you. But they can both be invaluable if you know how to wield them.
    ptx3 months ago
    Interfacing with Gmail seems pretty well covered with example code in the docs[0], so I don't see the AI adding much value. And the fiddly bit seems to be configuring the tokens, permissions and various things in the administration console, so how does the AI help with that? Did you give it administrative access to your Google account?
    [0] https://developers.google.com/workspace/gmail/api/quickstart...
    Eggpants3 months ago
    “You need to believe” pretty much says it all. Your example isn’t convincing because there will be only one correct answer with little variation(the API in question).
    I’m sure you’re finding some use for it.
    I can’t wait for when the LLM providers start including ads in the answers to help pay back all that VC money currently being burned.
    Both Facebook and Google won by being patient before including ads. MySpace and Yahoo both were riddled with ads early and lost. It will be interesting to see who blinks first. My money is on Microsoft who anded ands to Solitaire of all things.
    gchamonlive3 months ago
    If you don't believe computers have value you will default to writing on paper. That's what I meant with it. You need to believe first that there is something of value to be had there before exploring otherwise you are just aimlessly shooting and seeing what sticks. Maybe that gives you a better understanding of what I meant.
  - gchamonlive3 months ago
    Have LLMs replace developers for lower level code can be a goal but isn't the only one.
    You can use AI to assist you with lower level coding, maybe coming up with multiple prototypes for a given component, maybe quickly refactoring some interfaces and see if they fit your mental model better.
    But if you want AI to make your life easier I think you will have a hard time. AI should be just another tool in your toolbelt to make you more productive when implementing stuff.
    So my question is, why do you expect LLMs to be 100% accurate to have any value? Shouldn't developers do their work and integrate LLMs to speed up some steps in coding process, but still taking ownership of the process?
    Remember, there is no free lunch.
- keybored3 months ago
  > What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
  You’re not abstracting if you are generating code that you have to verify/fret about. You’re at exactly the same level as before.
  Garbage collection is an abstraction. AI-generated C code that uses manual memory management isn’t.
- ra0x33 months ago
  > What we are doing in practice when delegating coding to LLMs is climbing up the abstraction level ladder.
  100%. I like to say that we went from building a Millennium Falcon out of individual LEGO pieces, to instead building an entire LEGO planet made of Falcon-like objects. We’re still building, the pieces are just larger :)
crazygringo3 months ago
This is a real problem, and AI is a new vector for it, but the root cause is the lack of reliable trust and security around packages in general.
I really wonder what the solution is.
Has there been any work on limiting the permissions of modules? E.g. by default a third-party module can't access disk or network or various system calls or shell functions or use tools like Python's "inspect" to access data outside what is passed to them? Unless you explicitly pass permissions in your import statement or something?
- jeffparsons3 months ago
  You may be interested in WebAssembly Components: https://component-model.bytecodealliance.org/.
  Components can't do any IO or interfere with any other components in an application except through interfaces explicitly given to them. So you could, e.g., have a semi-untrusted image compression component composed with the rest of your app, and not have to worry that it's going to exfiltrate user data.
  - rini173 months ago
    So you refuse to learn from the history, because that's basically the UNIX model. That you string together simple text processing programs and any misbehaving program gets sigsegv without endangering anything, you don't have to worry. But it transpired that:
    1. splitting functionality in such way is not always possible or effective/performant, not to mention operators in practice tend to find fine grained access control super annoying
    2. and more importantly, even if the architecture is working, hostile garbage in your pipeline WILL cause problems with the rest of your app.
  - xrd3 months ago
    It doesn't seem like a stretch that an LLM will very soon be able to configure your dependent web assembly components to permit the dangerous access. It feels like this model of security, while definitely a step in the right direction, won't make a novice vibe coder any more secure.
    crazygringo3 months ago
    It seems like it would be rare, though.
    An LLM might hallucinate the wrong permissions, but they're going to be plausible guesses.
    It's extremely unlikely to hallucinate full network access for a module that has nothing to do with networking.
    xrd3 months ago
    I'm saying no hallucination will be happening.
    The LLM will happily write code that permits network access, because it read online an example that did that. And, unless you know better, you won't know to manually turn that off.
    Sandboxed WebComponents does not solve anything if your LLM thinks it is helping when it lets the drawbridge down for the orcs.
    crazygringo3 months ago
    That's a separate conversation then, because there's wrong information everywhere, but LLM's still do mostly OK. They don't just regurgitate stuff blindly, they look for patterns.
    And the article here is specifically about hallucinations, when it tries to plausibly fill something in according to a pattern.
    Wrong information on the internet is as old as the internet...
    xrd3 months ago
    Wrong code on the Internet does not steal your credit card information. Wrong code on localhost does.
    But, I think we agree, anyway.
- matsemann3 months ago
  Java used to have Java Security Manager, which basically made it possible to set permissions for what a jar/dependency could do. But deprecated and no real good alternative anymore.
  - fpoling3 months ago
    Java could have really nice security if it provided access to OS API via interfaces with main function receiving the interface for the real implementation. It would be possible then to implement really tight sandboxes. But that ship sailed 30 years ago…
- Philpax3 months ago
  My crank opinion is that we should invest in capability-based security, or an effects system, for code in general, both internal and external. Your external package can't pwn you if you have to explicitly grant it permissions it shouldn't have.
  - candiddevmike3 months ago
    I wonder how you could retrofit something like that onto Go for instance. I've always thought a buried package init function could be devastating. Allow/deny listing syscalls, sockets, files, etc for packages could be interesting.
    fpoling3 months ago
    Most languages have that early init problem. C++ allows global constructors, Java has class statics, Rust can also initialize thing globally.
    Even C allows library initializers running arbitrary code. It was used to implement that attack against ssh via malicious xz library.
    Disabling globals that are not compile-time constants or at least are never initialized unless the application explicitly called things will nicely address that issue. But language designers think that running arbitrary code before main is a must.
    Philpax3 months ago
    Rust doesn't have static initialisers for complex objects; it has lazy initialisers in the standard library that run when they're first requested, but there's no way to statically initialise any object more complex than a primitive: https://doc.rust-lang.org/reference/items/static-items.html#...
    fpoling3 months ago
    Thanks, I stand corected. Rust does not allow to initialize globals with arbitrary code running before main even with unsafe.
    One more point to consider Rust over C++.
- jruohonen3 months ago
  > This is a real problem, and AI is a new vector for it, but the root cause is the lack of reliable trust and security around packages in general.
  I agree. And the problem has intensified due to the explosion of dependencies.
  > Has there been any work on limiting the permissions of modules?
  With respect to PyPI, npm, and the like, and as far as I know: no. But regarding C and generally things you can control relatively easily yourself, see for instance:
  https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
  - exitb3 months ago
    It would be useful to have different levels of restrictions for various modules within a single process, which I don’t think pledge can do.
    singron3 months ago
    I don't think it's a bad idea, but currently packages aren't written with adversarial packages in mind. E.g. requests in Python should have network access, but probably not if it's called from a sandboxed package, but you might be able to trick certain packages into calling functions for you without having your package in the call stack (e.g. asyncio event loop or Thread). I think any serious attempt would get pushback from library authors.
    Also it's hard to argue against hard process isolation. Specter et al are much easier to defend against at process boundaries. It's probably higher value to make it easier to put sub modules into their own sandboxed processes.
    jruohonen3 months ago
    > It would be useful to have different levels of restrictions for various modules within a single process, which I don’t think pledge can do.
    Sure: the idea could be improved a lot. And then there is the maintenance burden. Here, perhaps a step forward would be if every package author would provide a "pledge" (or whatever you want to call the idea) instead of others trying to figure out what capabilities are needed. Then you could also audit whether a "pledge" holds in reality.
- ozim3 months ago
  We do have tools but adoption is sparse. It still too much hassle.
  You can do SLSA, SBOM and package attestation with confirmed provenance.
  But as mentioned it still is some work but more tools pop up.
  Downside is when you will have signed attested package that still will become malicious just like malware creators were signing stuff with help of Microsoft.
- numpad03 months ago
  To build tokenizers that use hashed identifiers rather than identifiers as plain English?
  e.g, "NullPointerException" can be a single kanji. Current LLM processes it like "N, "ull", "P", "oint", er", "Excep", "tion". This lets them make up "PullDrawerException", which is only useful outside code.
  That kind of creativity is not useful in code, in which identifiers are just friendly names for pointer addresses.
  I guess real question is how much business sense such a solution would make. "S in $buzzword stands for security" kind of thing.
  - nkmnz3 months ago
    Why not train an LAM, a Large AST Model?
    fpoling3 months ago
    That will miss comments and documentation.
    mdaniel3 months ago
    If that were true of AST implementations then "prettier"-esque tooling wouldn't exist. https://github.com/prettier/prettier/blob/3.5.3/src/main/com...
- mtkd3 months ago
  It's deeper than the security issue
  You could have two different packages in a build doing similar things -- one uses less memory but is slower to compute than the other -- so used selectively by scenario from previous experience in production
  If someone unfamiliar with the build makes a change and the assistant swaps the package used in the change -- which goes unnoticed as the package itself is already visible and the naming is only slightly different, it's easy to see how surprises can happen
  (I've seen o3 do this every time the prompt was re-run in this situation)
- pabs33 months ago
  The solution is social code review, don't use modules that haven't been reviewed by at least N people you trust.
  https://github.com/crev-dev/
- jsemrau3 months ago
  In Smolagents you can provide which packages are permitted. Maybe that's a shortcut to enforce this? I can't imagine that in a professional development house it's truly an n x m over all possible libraries.
aspbee5553 months ago
I am constantly correcting the AI code it gives me, and all I get for it is "oh your right! here is the corrected code"
then it gives me more hallucinations
correcting the latest hallucination results in it telling me the first hallucination
- dwringer3 months ago
  IME it is rarely productive to ask an LLM to fix code it has just given you as part of the same session context. It can work but I find that the second version often introduces at least as many errors as it fixes, or at least changes unrelated bits of code for no apparent reason.
  Therefore I tend to work on a one-shot prompt, and restart the session entirely each time, making tweaks to the prompt based on each output hoping to get a better result (I've found it helpful to point out the AI's past errors as "common mistakes to be avoided").
  Doing the prompting in this way also vastly reduces the context size sent with individual requests (asking it to fix something it just made in conversation tends to resubmit a huge chunk of context and use up allowance quotas). Then, if there are bits the AI never quite got correct, I'll go in bit by bit and ask it to fix an individual function or two, with a new session and heavily pruned context.
  - Jcampuzano23 months ago
    I agree with this, you will almost always get better results by simply undoing and rewording you prompt vs trying to coerce it to fix something it already did.
    Most of the time when I do use it, I almost always use just a couple prompts before starting a completely new one because it just falls off a cliff in terms of reliability after the first couple messages. At that point you're better off fixing it yourself than trying to get it to do it a way you'll accept.
    aspbee5553 months ago
    this is also what I started doing, sometimes it will give an actual correct answer but it usually easy to just start a new session. I can even ask the exact same question and get a correct answer with a new session
  - ta_098675345673 months ago
    100% this is my main workflow for a few years now when interacting with any LLM. Whenever I see people claim they struggle with using LLM as a part of their workflow, I ask them show me how you are solving a small problem with the AI, and subsequently, they show me this very sub-optimal workflow like GP is describing.
    1. Ask a question / present a problem, but usually without enough context to the problem and solution space they want to zero in on.
    2. The AI does an honest job given the context, but is off alignment in some specific way that the user did not clarify initially up front.
    3. Asks the AI to correct for this, along with some multiple other requests for changes toward the solution they want.
    4, 5, 6 Loop. They get a response, like the corrections (sometimes) and continue to make changes, in back-and-forth-conversation like interaction, only copying out corrected code blocks and copying in specific code chunks for correction.
    7+. The output gets progressively worse and worse, undoing corrections/changes/modifications that were previously discussed.
    At this point I try to interrupt the spiraling death loop and ask the user:
    - (rhetorical) why are you talking to the AI like a human being?
    - What is in your context window at this point in the conversation?
    If they can answer the context window question, AND understand how the AI ingests input and produces output, usually its a lightbulb moment. If they don't quite realize that they are polluting their context window, then I try to get them to be aware that everything in the context window is statistically weighted and will affect the output. If a tainted input is provided, the chances of an untainted output are lower than otherwise. You want to provide high quality context window input, ideally fully control it. That means, you do NOT want to have a conversation with the AI for real work; you need to embrace `zero shotting` everything you ask. This approach maximizes exactly what the AI are best trained for, trained on, how they are trained, and how they `understand` things.
    This requires a lot more hand holding and curating prompting, ie prompt engineering, than people will honestly realize/admit to. Prompt engineering isn't black magic, its intelligent contextualization that plays into the strengths of the implicit knowledge AI has. Worst things for a LLM super user?
    - copy-paste tedium (doing it by hand)
    - RAG auto-compression (letting an algorithm determine critical context decisions)
    - opaque context window systems (how is the conversation stored and presented to the LLM each turn?)
    - system prompt inaccessibility in certain online providers (system prompt is still super critical for driving)
    - general `magic` behavior exhibited when using a plain/simple chat interface (this is usually unraveled ONLY by understanding the full context window)
    The only LLM that has been SUCCESSFUL at conversing with me and maintaining state through the flowing conversation has been the newest Gemini 2.5 Pro offering, and ONLY up to 100K out of 1M context window. I have had (very minor) forgetting after 100K, and I deep dove into the conversation at that point to understand what was going on, and it appears that the conceptual conversation compression is in some way, lossy losing some conversation bits.
    Every other LLM has had the facade of maintaining conversation state, but only Gemini 2.5 Pro Preview has actually held that up (with firm limitations!). I suspect that large context window optimization/compression is to blame, some providers are aggressive with it.
- Jcampuzano23 months ago
  I find it's only really useful in terms of writing entire features if you're building something fairly simple, on top of using the most well known frameworks and libraries.
  If you happen to like using less popular frameworks, libraries, packages etc it's like fighting an uphill battle because it will constantly try to inject what it interprets as the most common way to do things.
  I do find it useful for smaller parts of features or writing things like small utilities or things at a scale where it's easy to manage/track where it's going and intervene
  But full on vibe coding auto accept everything is madness whenever I see it.
- mrweasel3 months ago
  Same thing happens to me. The LLM will make up some reasonably sounding answer, I correct it, three, four, five time, and then it circles back to the original answer... which is still just as wrong.
  Either they don't retain previous information, or they are so desperate to give you any answer that they'd prefer the wrong answer. Why is it that an LLM can't go: Yeah, I don't know.
- Trasmatta3 months ago
  I have this same experience. Vibe coding is literally hell.
dijksterhuis3 months ago
as ever, any task that has any sort of safety or security critical risks should never be left to a “magic black box”.
human input/review/verification/validation is always required. verify the untrusted output of these systems. don’t believe the hype and don’t blindly trust them.
—
i did find the fact that google search’s assistant just parroted the crafted/fake READMEs thing particularly concerning - propagating false confidence/misplaced trust - although it’s not at all surprising given the current state of things.
genuinely feel like “classic search” and “new-fangled LLM queries” need to be split out and separated for low-level/power user vs high-level/casual questions.
at least with classic search i’m usually finding a github repo fairly quickly that i can start reading through, as an example.
at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
- Arch4853 months ago
  > any task that has any sort of safety or security critical risks should never be left to a “magic black box”. > human input/review/verification/validation is always required.
  but, are humans not also a magic black box? We don't know what's going on in other people's heads, and while you can communicate with a human and tell them to do something, they are prone to misunderstanding, not listening, or lying. (which is quite similar to how LLMs behave!)
  - samtp3 months ago
    Well if a human consistently hallucinates as much as an LLM, you definitely not want them employed and would probably recommend they go to rehab.
  - dijksterhuis3 months ago
    from my comment
    > at the same time, i could totally see myself scanning through a README and going “yep, sounds like what i need” and making the same mistake (i need other people checking my work too).
    yes, us humans have similar issues to the magic black box. i’m not arguing humans are perfect.
    this is why we have human code review, tests, staging environments etc. in the release cycle. especially so in safety/security critical contexts. plus warnings from things like register articles/CVEs to keep track of.
    like i said. don’t blindly trust the untrusted output (code) of these things — always verify it. like making sure your dependencies aren’t actually crypto miners. we should be doing that normally. but some people still seem to believe the hype about these “magic black box oracles”.
    the whole “agentic”/mcp/vibe-coding pattern sounds completely fucking nightmare-ish to me as it reeks of “blindly trust everything LLM throws at you despite what we’ve learned in the last 20 years of software development”.
    brookst3 months ago
    Sounds like we just need to treat LLMs and humans similarly: accept they are fallible, put review processes in place when it matters if they fail, increase stringency of review as stakes increase.
    Vibe coding is all about deciding it doesn’t matter if the implementation is perfect. And that’s true for some things!
    dijksterhuis3 months ago
    > Vibe coding is all about deciding it doesn’t matter if the implementation is perfect. And that’s true for some things!
    i was going to say, sure yeah i’m currently building a portfolio/personal website for myself in react/ts, purely for interview showing off etc. probably a good candidate for “vibe coding”, right? here’s the problem - which is explicitly discussed in the article - vibe coding this thing can bring in a bunch of horrible dependencies that do nefarious things.
    so i’d be sitting in an interview showing off a few bits and pieces and suddenly their CPU usage spikes at 100% util over all cores because my vibe-coded personal site has a crypto miner package installed and i never noticed. maybe it does some data exfiltration as well just for shits and giggles. or maybe it does <insert some really dark thing here>.
    “safety and security critical” applies in way more situations than people think it does within software engineering. so many mundane/boring/vibe-it-out-the-way things we do as software engineers have implicit security considerations to bear in mind (do i install package A or package B?). which is why i find the entire concept of “vibe-coding” to be nightmarish - it treats everything as a secondary consideration to convenience and laziness, including basic and boring security practices like “don’t just randomly install shit”.
  - GeoAtreides3 months ago
    > We don't know what's going on in other people's heads
    I don't know about you, but for most people theory of mind develops around age 2...
dep_b3 months ago
It's true, hallucinations in LLM's can be so consistent that I warn the LLM up front about stuff like "do not use NSCacheDefault, it does not exist, and there is no default value" and then keeping my fingers crossed it doesn't find a roundabout way to introduce it anyway.
Can't really remember what is was exactly anymore, something in Apple's Vision libraries that just kept popping up if I didn't explicitly say to not use it.
totetsu3 months ago
Can’t we just move to have package managers point to a curated list of packages by default, with the option to enable an uncurated one if you know what your doing , ala Ubuntu source lists?
- jononor3 months ago
  At least having good integrated support in the package manager for an allow-list of packages would be good. Then one could maintain such lists in a company or project. And we could have community efforts to develop shared curated lists that could be starting points. If that really catches on, one could consider designating one of them as a default.
  Might also want to support multiple allow lists, so one can add to a standard list in a project (after review). And also deny, so one can remove a few without exiting completely from common lists.
- ozim3 months ago
  Then you are stuck on whatever passes the gates.
  It is shitloads of work to maintain.
  Getting new package from 0 to any Linux distribution is close to impossible.
  Debian sucks as no one gets on top of reviewing and testing.
  „Can we just” is not just there is loads of work to be done to curate packages no one is willing to pay for it.
  There is so far no model that works where you can have up to date cutting edge stuff reviewed. So you are stuck with 5 year old crap because it was reviewed.
  - tasuki3 months ago
    So many good packages made it into Debian relatively recently! Eg: fzf, fd-find, ripgrep, jq, exa, nvim, ...
    mvdtnz3 months ago
    I was excited to see you mention fzf in Debian so I just installed it and it's an ancient version that doesn't support shell integration. Not a great look.
    tasuki3 months ago
    I'm using Debian stable and its fzf. What do you mean by "doesn't support shell integration"?
    I use it... in my shell? Using various shortcuts? Ctrl+T to select file? Alt+C to change dir? Ctrl+R to search history? I use this for my shell integration:
    function maybe_source { [ -f "$1" ] && source "$1" } maybe_source /usr/share/doc/fzf/examples/key-bindings.zsh maybe_source /usr/share/doc/fzf/examples/completion.zsh
    pabs33 months ago
    fzf 0.60.3-1 from March 3rd is the latest in Debian, 0.61.1 from April 4 the latest upstream, so Debian is not exactly ancient. I guess you are using stable Debian? That doesn't get newer upstream releases, if you want the latest Debian fzf you can just install the build from unstable, which should work fine since Go does static linking, and there are often backports for when unstable packages can't be installed on stable.
    https://github.com/junegunn/fzf/tags https://tracker.debian.org/pkg/fzf
    mvdtnz3 months ago
    I don't know what to tell you - I just typed "sudo apt install fzf" and got 0.48. Yes, I'm using stable like most people.
- miohtama3 months ago
  Yes. But that would mean someone needs to work harder.
diggan3 months ago
If the LLM is "making up" APIs that don't exists, I'm guessing they've been introduced as the model tried to generalize from the training set, as that's the basic idea? These invented APIs might represent patterns the model identified across many similar libraries, or other texts people have written on the internet, wouldn't that actually be a sort of good library to have available if it wasn't already? Maybe we could use these "hallucinations" in a different way, if we could sort of know better what parts are "hallucination" vs not. Maybe just starting points for ideas if nothing else.
- OtherShrezzing3 months ago
  In my experience, what's being made up is an incorrect name for an API that already exists elsewhere. They're especially bad at recommending deprecated methods on APIs.
- Eggpants3 months ago
  It’s not that the imports don’t exist, they did in the original codebase the LLM creator stole from by ignoring the projects license terms.
- 3 months ago
  undefined
- brookst3 months ago
  Back in GPT3 days I put together a toy app that let you ask for a python program, and it hooked __getattr__ so if the LLM generated code called a non-existent function it could use GPT3 to define it dynamically. Ended up with some pretty wild alternate reality python implementations. Nothing useful though.
- marcosdumay3 months ago
  > wouldn't that actually be a sort of good library to have available if it wasn't already
  I for one do not want my libraries APIs defined by the median person commenting about code of making questions on Stack Overflow.
  Also, every time I see people using LLMs output as a starting point for software architecture the results became completely useless.
- skydhash3 months ago
  The average of the internet is heavily skewed towards the mediocre side.
stego-tech3 months ago
This was my chief critique when my company forced us to use their AI tooling. I was trying to stitch together our CMDB, two different VMware products, and the corporate technology directory into a form of product tenancy for our customers. At one point I was trying to move and transform data from our CRM into mongoDB, and figured "eh, let's knock these mandatory agent queries out of the way by asking the chatbot to help." I wrote a few prompts to try and explain what I was trying to accomplish (context), and how I'd like it done (instruction).
The bot hallucinated a non-existent mongoDB Powershell cmdlet, complete with documentation on how it works, and then spat out a "solution" to the problem I asked. Every time I reworked the prompt, cut it up into smaller chunks, narrowed the scope of the problem, whatever I tried, the chatbot kept flatly hallucinating non-existent cmdlets, Python packages, or CLI commands, sometimes even providing (non-working) "solutions" in languages I didn't explicitly ask for (such as bash scripting instead of Powershell).
This was at a large technology company, no less, one that's "all-in" on AI.
If you're staying in a very narrow line with a singular language throughout and not calling custom packages, cmdlets, or libraries, then I suspect these things look and feel quite magical. Once you start doing actual work, they're complete jokes in my experience.
kazinator3 months ago
Thank you, AI, for exposing the idiocy of package-driven programming, where everything is a mess of churning external dependencies.
- brookst3 months ago
  What’s the alternative? Statically linked binaries?
  - pabs33 months ago
    The alternative is to write code yourself.
  - kazinator3 months ago
    It's so bad, people can't even think of obvious alternatives.
    mdp20213 months ago
    Do you mean "shipping the right libs with your software" (in a self contained package)?
    That may mean a lot of redundancy.
alganet3 months ago
> "What a world we live in: AI hallucinated packages are validated and rubber-stamped by another AI that is too eager to be helpful."
That's actually hilarious.
WhitneyLand3 months ago
Seems to also especially love making up options and settings for command line tools.
1970-01-013 months ago
A few days ago: https://news.ycombinator.com/item?id=43644880
perrygeo3 months ago
My favorite is when the LLM hallucinates some function or an entire library and you call it out for the mistake. A likely response is "Oh, I'm sorry, You're right. Here's how you would implement function_that_does_not_exist()" and proceeds to write the library it hallucinated in the first place.
It's quirks like these that prove LLMs are a long long way from AGI.
xrd3 months ago
The only real solution I see is lint and ci tooling that prevents non approved packages from getting into your repo. Even with this there is potential for theft on localhost. There are a dozen new YC startups visible in just those two sentences.
- sausagefeet3 months ago
  Who do you think is going to be writing those linting rules after the first person that cared about it the most finishes?
  - xrd3 months ago
    Good point. And, surely a naughty LLM will get hacked and say "Hey, you've heard about this great thing called linting. Let me configure the newest and bestest one for you, it's called Rm-RF-Star..."
croemer3 months ago
The article contains nothing new. Just opinions including a security firm CEO selling his security offerings.
Read this instead, it's the technical report that is only linked to and barely mentioned in the article: https://socket.dev/blog/slopsquatting-how-ai-hallucinations-...
- feross3 months ago
  Hi — I’m the security firm CEO mentioned, though I wear a few other hats too: I’ve been maintaining open source projects for over a decade (some with 100s of millions of npm downloads), and I taught Stanford’s web security course (https://cs253.stanford.edu).
  Totally understand the skepticism. It’s easy to assume commercial motives are always front and center. But in this case, the company actually came after the problem. I’ve been deep in this space for a long time, and eventually it felt like the best way to make progress was to build something focused on it full-time.
- dijksterhuis3 months ago
  socket article seems to mostly be a review of this arXiv preprint paper: https://arxiv.org/pdf/2406.10279
  there’s also some info from Python software foundation folks in the register article, so it’s not just a socket pitch article.
- kazinator3 months ago
  That's not the technical report; it's also just a blog article which links to someone else's paper, and finishes off by promoting something:
  "Socket addresses this exact problem. Our platform scans every package in your dependency tree, flags high-risk behaviors like install scripts, obfuscated code, or hidden payloads, and alerts you before damage is done. Even if a hallucinated package gets published and spreads, Socket can stop it from making it into production environments."
Lvl999Noob3 months ago
Could the AI providers themselves monitor any code snippets and look for non-existent dependencies? They could then ask the LLM to create that package with the necessary interface and implant an exploit in the code. Languages that allow build scripts would be perfect as then the malicious repo only needs to have the interface (so that the IDE doesn't complain) and the build script can download a separate malicious payload to run.
- ezst3 months ago
  The AI providers already write the code, on the whole crazy promise that humans need not to care/read about it. I'm not sure that it changes anything at that point to add one weak level of indirection. You are already compromised.
vjerancrnjak3 months ago
Most of the code is badly written. Models are doing what most of their dataset is doing.
I remember, fresh out of college, being shocked by the amount of bugs in open source.
- simonw3 months ago
  More recent models are producing much higher quality code than models from 6/12/18 months ago. I believe a lot of this is because the AI labs have figured out how to feed them better examples in the training - filtering for higher quality open source code libraries, or loading up on code that passes automated tests.
  A lot of model training these days uses synthetic data. Generating good code synthetic data is a whole lot easier than any other category, as you can at least ensure the code you're generating is gramatically valid and executes without syntax errors.
- jccooper3 months ago
  The dataset isn't making up fake dependencies.
VladVladikoff3 months ago
Why can’t pypy / npm / etc just scan all newly uploaded modules for typical malware patterns before the package gets approved for distribution?
- simonw3 months ago
  Because doing so is computationally expensive and would be making false promises.
  False positives where it incorrectly flagged a safe package would result in the need for a human review step, which is even more expensive.
  False negatives where malware patterns didn't match anything previously would happen all the time, so if people learned to "trust" the scanning they would get caught out - at which point what value is the scanning adding?
  I don't know if there are legal liability issues here too, but that would be worth digging into.
  As it stands, there are already third parties that are running scans against packages uploaded to npm and PyPI and helping flag malware. Leaving this to third parties feels like a better option to me, personally.
  - VladVladikoff3 months ago
    >Leaving this to third parties feels like a better option to me, personally.
    Seems too late to me. At this point the module/package was already added into the ecosystem, it could potentially be some time (months?) before it is flagged by third party and removed.
- 12_throw_away3 months ago
  > Why can’t [X] just [Y] first?
  The word "just" here always presumes magic that does not actually exist.
  - jruohonen3 months ago
    > The word "just" here always presumes magic that does not actually exist.
    The magic here is, yes, AI. If you look at the mobile app stores, they've all become much better, although false positives occur, of course.
    simonw3 months ago
    Those App Stores also spend hundreds of millions of dollars a year on human staff. PyPI doesn't get to do that!
    jruohonen3 months ago
    Sure, but I'd guess PyPI could cut off much of the really bad stuff, such as malware, by AI (as everything is know called). Having a waiting list for false positives would not hurt anyone much. Yet, a foreseeable alternative is that PyPI and friends continue to be dumpyards, but communities will build up whitelists.
    simonw3 months ago
    See my comment here for why I don't think that would work: https://news.ycombinator.com/item?id=43665581
    There are a small number of PyPI things they require human support queues at the moment and they are sometimes overwhelmed already.
unoti3 months ago
When using AI, you are still the one responsible for the code. If the AI writes code and you don't read every line, why did it make its way into a commit? If you don't understand every line it wrote, what are you doing? If you don't actually love every line it wrote, why didn't you make it rewrite it with some guidance or rewrite it yourself?
The situation described in the article is similar to having junior developers we don't trust committing code and us releasing it to production and blaming the failure on them.
If a junior on the team does something dumb and causes a big failure, I wonder where the senior engineers and managers were during that situation. We closely supervise and direct the work of those people until they've built the skills and ways of thinking needed to be ready for that kind of autonomy. There are reasons we have multiple developers of varying levels of seniority: trust.
We build relationships with people, and that is why we extend them the trust. We don't extend trust to people until they have demonstrated they are worthy of that trust over a period of time. At the heart of relationships is that we talk to each other and listen to each other, grow and learn about each other, are coachable, get onto the same page with each other. Although there are ways to coach llm's and fine tune them, LLM's don't do nearly as good of a job at this kind of growth and trust building as humans do. LLM's are super useful and absolutely should be worked into the engineering workflow, but they don't deserve the kind of trust that some people erroneously give to them.
You still have to care deeply about your software. If this story talked about inexperienced junior engineers messing up codebases, I'd be wondering where the senior engineers and leadership were in allowing that to mess things up. A huge part of engineering is all about building reliable systems out of unreliable components and always has been. To me this story points to process improvement gaps and ways of thinking people need to change more than it points to the weak points of AI.
- jmaker3 months ago
  The pace differs though. A junior would need a week for a feature an LLM can produce in an hour. And you’re expected to validate that just as quickly. And LLMs are trained to appeal to the reader, unlike an average junior dev. Devs will only get lazy the more they rely on LLMs. It’s like you’re at the university and there’s no homework anymore, just lectures. You’re just passively ingesting data, not getting trained on real problems because you’ve got AI to do that for you. So you’re no longer challenged to grow anymore in your domain. What’s left are hard problems that the AI will mislead you on because it’s unfamiliar with them, and your opportunity to learn was lost to delegating to AI. In the end the pressure will grow at work, more features will be expected in shorter time frames. You’ll get even less time to learn and grow as a developer or engineer.
  - unoti3 months ago
    I hear you. But consider this: substitute “LLM” in what you said above with “coworker” or “direct report”.
    Does having a coworker automatically make a person dumb and no longer willing or able to grow? Does an engineer who becomes a manager instantly lose their ability to work or grow or learn? Sometimes, yes I know, but it’s not a foregone conclusion.
    Agents are a new tool in our arsenal and we get to choose how we use them and what it will do for us, and what it will do to us, each as individuals.
    jmaker3 months ago
    You can substitute anything like “banana” or “parrot” for that and ask the same question.
    Change of roles is a twist I didn’t suggest, it’s not related to my argument. I was talking about an engineering role. I’m not seeing an analogy with what you’re suggesting. Even less so does your suggested “immediately” resonate with me. Such transitions are rarely is immediate. Growth on an alternative career path is a different story.
    The problem that I see here is that we’re not given that choice you’re considering. Take for example the recent Shopify pivot. It is now expected by the management because they believe the exaggerated hype, especially motivated during the ongoing financing crunch - in many places. So it’s not a lawnmower we’re talking about here but an oracle one would need to be capable of challenging.
curiousgal3 months ago
If you get pwned by some AI code hallucination you deserve it honestly. They're code assistants not code developers.
- kazinator3 months ago
  If you get pwned by external dependencies in any way, you deserve it.
  This idea of programs fetching reams of needed stuff from the cloud somewhere is a real scourge in programming.
keybored3 months ago
I have no other gear than polemic on the topic of AI-for-code-generation so ignore this comment if you don’t like that.
I think people in software envy real-engineering too much. Software development is what it is. If it does not live up to that bar then so be it. But AI-for-code-generation (“AI” for short now) really drops any kind of pretense. I got into software because it was supposed to be analytic, even kind of a priori. And deterministic. What even is AI right now? It melds the very high tech and probabilistic (AI tech) with the low tech of code generation (which is deterministic by itself but not with AI). That’s a regression both in terms of craftmanship (code generation) and so-called engineering (deterministic). I was looking forward to higher-level software development: more declarative (better programming languages and other things), more tool-assisted (tests, verification), more deterministic and controlled (Nix?), and less process redundancies (e.g. less redundancies in manual/automated testing, verification, review, auditing). Instead we are mining the hard work of the past three decades and spitting out things that have the mandatory label “this might be anything, verify it yourself”. We aren’t making higher-level tools—we[1] are making a taller tower with less support beams, until the tower reaches so high that the wind can topple it at any moment.
The above was just for AI-for-code-generation. AI could perhaps be used to create genuinely higher level processes. A solid structure with better support. But that’s not the current trajectory/hype.
[1] ChatGPT em-dash alert. https://news.ycombinator.com/item?id=43498204
- Tainnor3 months ago
  This resonates with me. There are so many tools and techniques that many developers refuse to adopt because they're not hyped and take maybe a week to learn (e.g. take something like TLA+ which could be used to reason about distributed systems), but instead of improving the craft of programming we're just using LLMs to spam bad quality software at a faster rate.
onionisafruit3 months ago
I’m not measuring it, but it seems like copilot suggests fewer imports than it used to. It could be that it has more context to see that I rarely import external packages and follows suit. Or maybe I’m using it subtlety different than I used to.
nurettin3 months ago
When I suspect that it will make stuff up, I tell it to cite the docs that contain the functions it used. It causes more global warming, but it works fine.
jmaker3 months ago
I can’t speak to how others apply the LLMs, but in my coding experience they’re mostly in the way. A couple days ago VSCode released an agentic workflow similar to Cursor’s. And I must say, the YouTube demo was persuasive and I thought I’d just have to move to VSCode off JetBrains IDEs.
So I gave it a spin, and after the past couple days, it’s been the most terrible IDE experience so far. The LLMs are always in the way, I’ve got Claude 3.5, 3.7, o1, o3-mini, o4, Gemini 2.0-flash, 2.5-pro, with/without reasoning, own models. Embedded Copilot is bugged, editor/agentic Copilot is bugged - it breaks your code if you selectively reject suggestions, your file buffer gets mangled, need to revert everything completely even if something was useful. Sidebar chat can get just as confusing as before. Typescript, Python, Java, Kotlin, Go. Rust won’t even compile, and don’t get me started on C++ codebases. Never had it type-check with mypy and pylance.
In many cases even with codebases, extra MCP servers and fetching remote docs, it’s just not up to the task of making code even type check. Sometimes it just times out or fails on network errors. Very fragile, unreliable, misleading.
I don’t know what people vibe code, but for a variety of codebases I’ve had to work on, it’s just in the way, injecting nonsense or outright garbage, which I need to reject every time. It’s useful as a sed alternative, but a less reliable one, a regex is often faster than 3+ prompts and waiting. It breaks my flow of conscience, I lose creativity and need to check everything after it. Dunning-Kruger maybe.
To me that workflow is but a tailored and integrated StackOverflow, with snippets adapted to your code. Not sure how productive it is to let snippet insertions interfere with your flow, but very helpful when you forget or stumble.
The more people rely on it, the more surprise awaits around the corner the moment AI fails. Now devs rely more on the network link instead of their brain, just like when people used to vibe code using StackOverflow. Creativity is at the stake. It must be kept in tone to stay productive.
There’s a lot of work that’s a waste of time. If the goal is to replace devs, such companies will lose money in the end. If the goal is to assist devs and make them more productive, the LLMs need to adapt to take over such tasks reliably, e.g., scaffolding, standard algorithms, “best practices,” simulating and questioning design/architecture, and the UX must improve.
ritabratamaiti3 months ago
Slop in slop out
jaco63 months ago
[dead]
TheSwordsman3 months ago
I'm waiting for the AI apologists to swarm on this post explaining how these are just the results of poorly written prompts, because AI could not make mistakes with proper prompts. Been seeing an increase of this recently on AI-critical content, and it's exhausting.
Sure, with well written prompts you can have some success using AI assistants for things, but also with well-written non-ambiguous prompts you can inexplicably end up with absolute garbage.
Until things become consistent, this sort of generative AI is more akin to a party trick than being able to replace or even supplement junior engineers.
- simonw3 months ago
  As an "AI apologist", sorry to disappoint but the answer here isn't better prompting: it's code review.
  If an LLM spits out code that uses a dependency you aren't familiar with, it's your job to review that dependency before you install it. My lowest effort version of this is to check that it's got a credible commit and release history and evidence that many other people are using it already.
  Same as if some stranger opens a PR against your project introducing a new-to-you dependency.
  If you don't have the discipline to do good code review, you shouldn't be using AI-assisted programming outside of safe sandbox environments.
  (Understanding "safe sandbox environment" is a separate big challenge!)
  - TheSwordsman3 months ago
    Yep. The issue is most people I've seen who lean most on these tools do not have that discipline.
    simonw3 months ago
    Being good at reading and reviewing code is quite a rare skill!
- akdev1l3 months ago
  One time some of our internal LLM tooling decided to delete a bunch of configuration and replace it with: “[EXISTING CONFIGURATION HERE]”
  Lmfaooo
  - orbital-decay3 months ago
    Haha. That sounds like something Sonnet 3.6 would do, it learned to cheat that way and it's an absolute pain in the ass to make it produce longer outputs.
  - TheSwordsman3 months ago
    Hahahaha. That's actually amazing.
- bslalwn3 months ago
  You are getting replaced, man. Burying your head in the sand won’t help.
bdw52043 months ago
This is just another reason why dependencies are an anti-pattern. If you do nothing, your software shouldn't change.
I suspect that this style of development became popular in the first place because the LGPL has different copyright implications based on whether code is statically or dynamically linked. Corporations don't want to be forced to GPL their code so a system that outsources libraries to random web sites solves a legal problem for them.
But it creates many worse problems because it involves linking your code to code that you didn't write and don't control. This upstream code can be changed in a breaking way or even turned into malware at any time but using these dependencies means you are trusting that such things won't happen.
Modern dependency based software will never "just work" decades from now like all of that COBOL code from the 1960s that infamously still runs government and bank computer systems on the backend. Which is probably a major reason why they won't just rewrite the COBOL code.
You could say as a counterargument that operating systems often include breaking changes as well. Which is true but you don't update your operating system on a regular basis. And the most popular operating system (Windows) is probably the most popular because Microsoft historically has prioritized backward compatibility even to the extreme point of including special code in Windows 95 to make sure it didn't break popular games like SimCity that relied on OS bugs from Windows 3.1 and MS-DOS[0].
[0]: https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...
- maleldil3 months ago
  What are you advocating for? Zero external dependencies? Write a new YAML parser from scratch? Rolling your own crypto?
CamperBob23 months ago
Usually, when the model hallucinates a dependency, the subject of the hallucination really should exist. I've often thought that was kind of interesting in itself. It can feel like a genuine glimpse of emergent creativity.
- mdp20213 months ago
  Children may invent the world as they do not know it well yet. Adults know that reality is not what you may expect. We need to deal with reality, so...
JohnCClarke3 months ago
IME when AI "hallucinates" API endpoints or library functions that just aren't there it's almost always the case that they should be. In other words the AI has based it's understanding on the combined knoweledge of hundreds(?) of other APIs and libraries and is geenrating an obvious analogy.
Turning this around: a great use case is to ask AI to review documents, APIs, etc. AI is really great for teasing out your blindspots.
- croes3 months ago
  If the training data contains useless endpoints the AI will also hallucinate those useless endpoints.
  The wisdom of the crowd only works for the end result not if you consider every given answer, then you get more wrong answers because you fall to the average.
- esafak3 months ago
  The next step could be to ask it to generate the missing function.
- jmaker3 months ago
  Hard no. I’ve had that in lots of cases, it just applied some symmetry pattern to come up with purposefully absent public endpoints that exist for employees only. It gets dangerous the moment you put as much trust in it as you’re suggesting.