Making o1, o3, and Sonnet 3.7 hallucinate for everyone(bengarcia.dev)

267 pointsby hahahacorn4 months ago22 comments

andix4 months ago
I've got a lot of hallucinations like that from LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.
- QuantumGood4 months ago
  GPT can't even tell what its done or give what it knows it should. It's endless, "Apologies, here is what you actually asked for ..." and again it isn't.
  - derefr4 months ago
    I would love an AI coding assistant that doesn't respond until it's gone out and tested its answer in a sandbox to confirm it actually compiles.
    AprilisKalends4 months ago
    I personally haven't figured out why there isn't a tool that just loops the AI severals times on a task by compiling, feeding in the errors adjusting and repeating and then letting the user review the result be in a success or failure by exceed the loop limits.
    staticautomatic4 months ago
    Claude has a tendency to overcomplicate bug fixes. If you put it on loop it would probably build a custom ERP platform.
    anonzzzies4 months ago
    The LLMs want to please, they can infinitely keep making 'important' changes; changing variable names without changing anything else, adding/removing comments, adding removing print statements, moving a function somewhere in the code (usually by first duplicating it) and fixing perfectly fine code with 'let me see if everything is really working' (it was and now it's broken again).
    derefr4 months ago
    "Wanting to please" is just fine-tuned implicit prompt stuff, though. You can tell the LLM that it's roleplaying as a senior engineer, and should consider you, the human to be a junior engineer it works with, who is asking it for advice. You can just leave it at that (often works, if the AI knows enough about how software is produced); or you can describe particular traits of senior engineers — e.g. "wants code that works" and "knows when to say the code is good enough and that the junior is now wasting time by fiddling with it."
    ta9884 months ago
    They exist in various forms...
    4 months ago
    undefined
    throawayonthe4 months ago
    [dead]
    v-yanakiev4 months ago
    https://syntix.pro automatically executes and verifies the code it generates. Disclaimer: I made it.
    mycall4 months ago
    Devin does that
    anonzzzies4 months ago
    Wasn't Devin a scam? Honest question; Google is not very clear about it but some people said it was a scam/money grab. You are saying it works?
  - MortyWaves4 months ago
    This was my primary reason for using Claude. Absolutely useless experience with chatgpt oftentimes. I’ve mainly been using LLMs to help maintain a ridiculously poorly made technical debt dumpster fire, and Claude has been really helpful here mainly with repetitive code.
  - therein4 months ago
    Yeah ChatGPT is truly the worst. Claude has been consistently much better. Grok 3 is surprising me every day.
- kgeist4 months ago
  I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff. Sure it makes mistakes occasionally but usually it's a no-brainer to fix them.
  - Terr_4 months ago
    > I use LLMs for writing generic, repetitive code, like scaffolding. It's OK with boring, generic stuff.
    In other words, they're OK in use-cases that programmers need to eliminate, because it means there's high demand for a reusable library, some new syntax sugar, or an improved API.
    hombre_fatal4 months ago
    Boilerplate and plumbing code isn't inherently bad, nor do you improve the codebase by factoring it down to zero with libraries and abstractions.
    As I've matured as a developer, I've appreciated certain types of boilerplate more and more because it's code that shows up in your git diffs. You don't need to chase down the code in some version of some library to see how something works.
    Of course, not all boilerplate is created equally.
    andix4 months ago
    Boilerplate is better than bad abstractions. But good abstractions are far superior.
    xmprt4 months ago
    I agree with you but as I've matured as a programmer, I feel like it's very hard to get abstractions for boilerplate right. Every library I've seen attempt to do it has struggled.
    Spivak4 months ago
    Even Rails went with codegen despite being backed by the language most amenable to abstractions.
    int_19h4 months ago
    I would consider deterministic codegen to be a valid abstraction (or rather a valid implementation of some abstraction). It's not really any different from a regular library wrt ability to test & validate. The problem with any human- and LLM- handwritten code is that even the best coders make mistakes.
    bubblyworld4 months ago
    Abstractions are like mutations - most of them kill the host =P
    pertymcpert4 months ago
    The best abstraction is no abstraction.
    andix4 months ago
    No.
    Edit: I'm not going to code asm just to be cool.
    brookst4 months ago
    asm? Why would anyone use that useless abstraction over byte code?
    nuancebydefault4 months ago
    bytes are abstractions. Only use them if you know upfront how many bits you need.
    4 months ago
    undefined
    airstrike4 months ago
    "A program is like a poem. You cannot write a poem without writing it." — Dijkstra
    4 months ago
    undefined
    anonzzzies4 months ago
    > because it means there's high demand for a reusable library
    Every 'modern' project takes bucket loads of (annoying) setup and plumbing. Even in rather trivial 'start' cases, the LLM has to spend quite a bit of time to get a hello world thingy working (almost all of that is because 'modern programmers' have some kind of brain damage concerning backward compatibility; move fast and break things between MINOR versions that have no use or reason for those changes whatsoever but 'they liked it better', so stuff never works as it says on the product page; heaven help you if you need something exotic to be added). It's a terrible timeline for programming, but LLMs do fix at least that annoyance by just changing configs/slabs of code it until it works.
    mvdtnz4 months ago
    I'll take boring scaffolding code over libraries that perform undebuggable magic with monkey patches, reflection or dynamic code.
    TeMPOraL4 months ago
    We won't be eliminating anything until we give up on only ever working directly on plaintext single source of truth code. AI is automating tedium that's otherwise impossible to automate because we're stuck in a 1970s Unix paradigm and can't let go.
    anonzzzies4 months ago
    But people, especially here on HN, are saying it's the only way to go.
    MaxikCZ4 months ago
    You can use niche libraries and still benefit from AI, basically anything I want to code is something that I don't think exists, and to bend API in a way that makes a generic library do just what I want means basically writing the same code, just instead of contributing to library its immediate, because the "piping" around it magically appears. I bet if it keeps improving at a current late for a decade, the occupation "programmer" will morph into "architect".
    woah4 months ago
    With LLMs, I hope we move towards libraries that make code easier to read, not easier to write.
  - andix4 months ago
    I try to keep "boring" code to a minimum, by finding meaningful and simple abstractions. LLMs are especially bad handling those, because they were not trained on non-standard abstractions.
    Edit: most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.
    simion3144 months ago
    >most LLMs are great for spitting out some code that fulfills 90% of what you asked for. That's sometimes all you need. But we all know that the last 10% usually take the same amount of effort as the first 90%.
    The issue is if you have an LLm write for you 10k lines of code, where 100 lines are bugged. Now you need to debug the code you did not write and find the bugged code, you will waste similar amount of time. The issue is if you do not catch the bugs in time, you think you gain some hours but you will get upset customers because things went wrong because the code is weird.
    From my experience you need to work withan LLM and have the code done function by function, with your input and you checking it and calling bullshit when it does stupid things.
    xmprt4 months ago
    In my experience using LLMs, the 90% is less about buggy code and more about just ignoring 10% of the features that you require. So it will write code that's mostly correct in 100-1000 lines of code (not buggy) but then no matter how hard you try, it won't get the remaining 10% right and in the process, it will mess up parts of the 90% that was already working or end up writing another 1000 lines of undecipherable code to get 97% there but still never 100% unless you're building something that's not that unique.
    andix4 months ago
    Exactly my experience. It's always missing something. And the generated code often can't be extended to fulfil those missing aspects.
    rafaelmn4 months ago
    This is what got me in most sleepless nights, crunch and ass clenching production issues over my career.
    Simple repetitive shit is easy to reason about, debug and onboard people on.
    Naturally it's balancing act, and modern/popular frameworks are where most people landed, there's been a lot of iteration in this space for decades now.
    andix4 months ago
    I've made the opposite observation. Without proper abstractions code bases grow like crazy. At some point they are just a huge amount of copy, paste, and slight modification. The amount of code often grows exponentially. With more lines of code comes more effort to maintain it.
    After a few years those copy and pasted code pieces completely drift apart and create a lot of similar but different issues, that need to be addressed one by one.
    My approach for designing abstractions is always to make them composable (not this enterprise java inheritance chaos). To allow escaping them when needed.
    Spivak4 months ago
    > At some point they are just a huge amount of copy, paste, and slight modification.
    I mean is that bad? Unless you keep having to have huge MRs that modify every copy/paste could you just let the code sit there and run forever?
    I only say this because I've been a maintenance programmer and I could only dream of a codebase like this. The idea that I get a Rollbar with a stack trace and the entirety of what the code actually does is laid bare right at the site of the error in a single file is amazing. And I can change it without affecting anything else?! I end up having to "unwind" all of the abstractions anyway because the nature of the job means I'm not intimately familiar with the codebase and don't just know where the real work happens.
    notpushkin4 months ago
    There is a balance. Too much abstractions isn’t necessarily better than not enough abstractions. The optimal amount is usually non-zero, though.
    rafaelmn4 months ago
    But the point is that modern/popular frameworks picked the low hanging fruit of abstraction, we had several iteration cycles over two+ decades of mainstream web tech.
    The landscape (browser capabilities, backend stacks) has settled over last ten years.
    We even had time to standardize on things somewhat.
    Building new abstractions at this point is almost always the wrong move. This is one of the points that LLMs will improve in software dev - they will kill the framework churn because they will work best on stuff already in training data.
  - deterministic4 months ago
    I use a custom code generator for that. It works much better than trying to explain to an AI exactly what I want. Yes it’s a it more work up front but worth it IMHO.
- magicalhippo4 months ago
  I've used it for some smaller greenfield code with success. Like, write an Arduino program that performs a number of super-sampled analog readings, and performs a linear regression fit, printing the result to the serial port.
  That sort of stuff can be very helpful to newbies in the DIY electronics world for example.
  But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines.
  - andix4 months ago
    > Like, write an Arduino program that performs
    stuff like that works amazing
    > But for anything involving my $dayjob it's been fairly useless beyond writing unit test outlines
    This was my opinion 3-6 months ago. But I think a lot of tools matured enough to already provide a lot of value for complex tasks. The difficult part is to learn when and how to use AI.
  - ianbutler4 months ago
    I use it everyday, it has to have good search and good static analysis built in.
    You also have to be very explanatory with a direct communication style.
    Our system imports the codebase so it can search and navigate plus we feed lsp errors directly to the LLM as development is happening.
    magicalhippo4 months ago
    > we feed lsp errors directly to the LLM as development is happening
    Yeah I guess that would help a lot. Stuck with a bit more primitive tools here, so that doesn't help.
  - nottorp4 months ago
    > That sort of stuff can be very helpful to newbies
    Yep, if it's newbie stuff there are enough tutorials out there for the LLM to have data.
    It's when you get off the tutorials and do the actual functionality of your project that they fail.
- bakugo4 months ago
  > I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up
  They can't, they usually just don't understand the code enough to notice the issues immediately.
  The perceived quality of LLM answers is inversely proportional to the user's understanding of the topic they're asking about.
  - mlyle4 months ago
    Alternatively, we understand it well, and discard bad completions immediately.
    When I'm using llama.vim, like 40% of what it writes in a 4-5 line completion is exactly what I'd write. 20-30% is stuff that I wouldn't judge coming from someone else, so I usually accept it. And 30-40% is garbage... but I just write a comment or a couple of lines, instead, and then reroll the dice.
    It's like working through a junior engineer, except the junior engineer types a new solution instantly. I can get down to alternating between mashing tab and writing tricky lines.
    bakugo4 months ago
    I've tried using basic AI completions before and found that the signal-to-noise ratio wasn't quite good enough for my taste in my use cases, but I can totally understand it being good enough for others.
    My comment was more about just asking questions on how to do things you're totally clueless about, in the form of "how do I implement X using Y?" for example. I've found that, as a general rule, if I can't find the answer to that question myself in a minute or two of googling, LLMs can't answer it either the majority of the time. This would be fine if they said "I don't know how to do that" or "I don't believe that's possible" but no, they will confidently make up code that doesn't work using interfaces that don't exist, which usually ends up wasting my time.
    mlyle4 months ago
    I've got basically the opposite experience. I receive very little useful toplevel help or directing me to appropriate resources.
    But if I'm writing out a bunch of linear algebra, I get a lot of useful completions and avoid tediousness.
    I've settled on Qwen2.5.1-Coder-7B-Instruct-Q6_K_L.gguf-- so not even a very big model.
    andix4 months ago
    I don't see the point in AI code completions, they are just distracting noise. I'm only doing bigger changes with AI.
    Prompt based stuff, like "extract the filtering part from all API endpoints in folder abc/xyz. Find a suitable abstraction and put this function into filter-utils.codefile"
    Lorak_4 months ago
    What tools do you use to perform such tasks?
    andix4 months ago
    Aider. I tried Cursor too, but I don't like VS Code and not being able to chose the LLM provider. I think there are already a lot of tools that perform kind of equally.
    virgilp4 months ago
    What do you mean? You can chose the LLM in Cursor. (I don't like VSCode either, but unfortunately Cursor is best for the rest of the UX; I find myself using Cursor for the prompt-based part of the work, and then move to IntelliJ to do "my" part of the coding)
  - andix4 months ago
    That's more or less my suspicion.
    A few months ago I tried to do a small project with Langchain. I'm a professional software developer, but it was my first Python project. So I tried to use a lot of AI generated code.
    I was really surprised that AI couldn't do much more than in the examples. Whenever I had some things to solve that were not supported with the Langchain abstractions it just started to hallucinate Langchain methods that didn't exist, instead of suggesting some code to actually solve it. I had to figure it out by myself, the glue code I had to hack together wasn't pretty, but it worked. And I learned not to use Langchain ever again :)
  - johnisgood4 months ago
    Exactly.
- L-four4 months ago
  The trick to coding with LLMs is not caring if the code is correct.
  - andrelaszlo4 months ago
    Call me cynical but coding at some companies has such perverse incentives that I kind of get it:
    - "Solve" the issue assigned to me with a bunch of code that looks about right. Passes review and probably not covered by tests anyway.
    - Once QA or customers notice it's not working, I can get credit for "solving" the bug as well.
    - Repeat for 0 value delivered but infinite productivity points in my next performance review.
    nsoonhui4 months ago
    I suppose that you are using a dynamic language? Static typed languages have less of this problem.
    andrelaszlo4 months ago
    This is a hypothetical "I". Personally I am deeply passionate about delivering shareholder value, producing high-quality code, enthusing stakeholders, tabs vs spaces, and so forth...
- ninininino4 months ago
  A language like Golang tries really hard to only have _one_ way to do something, one right way, one way. Just one way. See how it was before generics. You just have a for loop. Can't really mess up a for loop.
  I predict that the variance in success in using LLM for coding (even agentic coding with multi-step rather than a simple line autosuggest or block autosuggest that many are familar with via CoPilot) has much more to do with:
  1) is the language a super simple, hard to foot-gun yourself language, with one way to do things that is consistent
  AND
  2) do juniors and students tend to use the lang, and how much of the online content vis a vis StackOverflow as an example, is written by students or juniors or bootcamp folks writing incorrect code and posting it online.
  What % of the online Golang is GH repo like Docker or K8s vs a student posting their buggy Gomoku implementation in StackOverflow?
  The future of programming language design has AI-comprehensibility/AI-hallucination-avoidance as one of the key pillars. #1 above is a key aspect.
  - Yoric4 months ago
    Note that "super-simple", "hard to footgun yourself" and "one way to do things that is consistent" are three very different things.
    I don't think that we yet have one language that is good at all that. And yes, I (sometimes) program in Go for a living.
  - andix4 months ago
    > A language like Golang tries really hard to only have _one_ way to do something
    Really?
    Logging in Go: A Comparison of the Top 9 Libraries
    https://betterstack.com/community/guides/logging/best-golang...
    evanmoran4 months ago
    Since Go added slog many of these have been removed in favor of that. Obviously not universal, but compared to npm there really are massive numbers of devs just using the standard library.
    throwaway9201024 months ago
    I would argue logging options to be more of an exception than the rule. Compare the actual language features of Go to something like Rust or Javascript and you'll see what I mean. As a new developer to the language (especially for juniors), you can learn all the features of Go much faster. It's made to be picked up quickly and for everyone's code to look the same, rather than expressing a personal style.
    duskwuff4 months ago
    I find it hard to imagine how a language could enforce that no two (third-party) libraries can implement the same functionality.
- runeblaze4 months ago
  They are good at (combining well-known, codeforces-style) algorithms; often times I don’t care about the syntax, but I need the algorithm. LLMs can write pseudocode for all I care but they tend to get syntax correct quite often
- pllbnk4 months ago
  Sometimes I don't know anything (relatively) about the topic and I want to get a foundation so I don't care whether the syntax or the code is valid as long as it points me in the right direction. Other times I know exactly what I want and I just find that instructing LLM specifically what output I expect from it just helps me get there faster.
- zelphirkalt4 months ago
  Probably by coding things that are bery mainstream and have already been fed to the LLM a thousand times from ripped off projects.
- johnisgood4 months ago
  I have made large projects using Claude, with success. I know what I want to do and how to do it, maybe my prompts were right.
  - miunau4 months ago
    How do you deal with large files? After about a thousand lines in a file, it starts to cough for me. Forgets that some functions exist and makes up inferior duplicate ones.
    johnisgood4 months ago
    I did not experience hallucinations (very rarely if at al) when I use it for programming. It happened more with niche languages (so I provide examples and documentation), and with GPT.
    Let us say there is 3k lines of RFCs, API, documentation of niche languages, examples), 2k lines of code generated by Claude (iteratively, starting small), then I do exceed the limit after a while. In that case I ask it to summarize everything in detail, start a new chat, use those 3k lines and the recent code, and continue ad infinitum.
    0x5f3759df-i4 months ago
    If possible you need to refactor before getting to that point.
    Claude has done a good job refactoring, though I’ve had to tell it to give me a refactor plan upfront in case the conversation limit gets hit. Then in a new chat I tell it which parts of the plan it has already done.
    But a larger context/conversation limit is definitely needed because it’s super easy to fill up.
    anonzzzies4 months ago
    I ask it to cut files up when they get too large, that seems t work pretty well. It is also good at that, but you have to sternly tell it NOT to make any functional changes, otherwise it will and breakage will happen.
    4 months ago
    undefined
  - mattmanser4 months ago
    What do you define as a large project? Like TLOC?
    williamcotton4 months ago
    This is 11,682 lines of C (and not including some Lua and Python scripts) according to cloc:
    https://github.com/williamcotton/webdsl
    It's a pipeline-based DSL for building web apps with SQL, Lua, jq and mustache templates.
    I'd say it's like 90% Cursor Composer in Agent mode.
    This is probably more like a mid-sized project, right?
    anonzzzies4 months ago
    Quite nice that.
    anonzzzies4 months ago
    I've done >100k LoC with Claude; by far most of that is frontend in react/ts which is incredibly verbose and wasteful, so very easy to pack up many many lines quickly without a lot (mostly none) of ROI per line. Which is probably why claude is great at it.
    johnisgood4 months ago
    In my case the maximum was ~3k LOC.
    crooked-v4 months ago
    That's not a "large" project, it's either a toy or a single-purpose tool.
    johnisgood4 months ago
    A single-purpose tool, and libraries, yes.
    What is the LOC of cgit? Because I made a lot of changes to it, too, albeit privately. I uploaded most files as project files.
    mvdtnz4 months ago
    That's not just small, it's utterly miniscule. It's most certainly not large.
    ForTheKidz4 months ago
    Nah this is miniscule: https://github.com/coreutils/coreutils/blob/master/src/yes.c
    You can fit a hell of a lot of functionality in 3k statements. Really whether it's considered large or small necessarily must rely on the functionality it's intended to provide.
    stavros4 months ago
    You're thinking on "lean" vs "bloated". "Large" and "small" have meanings of their own, and a 3k LOC project wouldn't be accepted as "large" by anyone. "Small", maybe.
    ForTheKidz4 months ago
    I certainly never described 3k as large, so I'll assume you replied to the wrong comment. If not, let's just agree to use the terms differently.
    johnisgood4 months ago
    Depends. 3k is pretty much enough for a fully-featured XY.
    So no context, and differences of the definition of "large".
    Perhaps if you come from Java, then yeah.
    shrugs
    magicalhippo4 months ago
    I'd still call 3kLOC quite small in all mainstream languages.
    I worked on a Python project which I'd consider as medium sized, ie not small but not large, and it was around 25kLOC.
    My $dayjob is a Delphi codebase with roughly 500kLOC, which I'd say is large but not huge.
    Though if you wrote it in something like K[1], then yeah ok, I'd agree 3kLOC probably counts as large.
    [1]: https://en.wikipedia.org/wiki/K_(programming_language)
    mvdtnz4 months ago
    I have spent my career working on software that measures its size in MLOC (millions of lines of code). Not because it's Java, but because it's big.
- HarHarVeryFunny4 months ago
  What TFA was talking about didn't really seem like an hallucination - just a case of garbage in-garbage out. Normally there are more examples of good/correct data in the training set than bad, so statistically the good wins, but if it's prompted for something obscure maybe bad is all that it has got.
  Common coding tasks are going to be better represented in the training set and give better results.
  - genewitch4 months ago
    It means that "reasoning" isn't.
- afro884 months ago
  Here's the secret to coding with an LLM: don't expect it to get things 100% correct. You will need to fix something almost every time you use it to generate code. Maybe a name here, maybe a calculation, maybe a function signature. And maybe you won't spot the issue until later.
  You still "used" an LLM to write the code. And it still saved you time (though depending on the circumstances this can be debatable).
  That's why all these people say they use LLMs to write lots of code. They aren't saying it did it 100% without any checking and fixing. They're just saying they used it.
- bugglebeetle4 months ago
  > LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.
  You write tests in the same way as you would when checking your own work or delegating to anyone else?
- pinoy4204 months ago
  A good prompt. You don’t just ask it. You tell it how to behave and give it a shot load of context
  - andix4 months ago
    With Claude the context window is quite small. But with adding too much context it often seems to get worse. If the context is not carefully narrowly picked and too unrelated, the LLMs often start to do unrelated things to what you've asked.
    At some point it's not really worth anymore creating the perfect prompt, just code it yourself. Also saves the time to carefully review the AI generated code.
    johnisgood4 months ago
    Claude's context window is not small, is it not larger than ChatGPT's?
    andix4 months ago
    I just looked it up, it seems to be the rate limit that's actually kicking in for me.
    johnisgood4 months ago
    Yes, that's it! It is frustrating to me, too. You have to start a new chat with all relevant data, and a detailed summary of the progress/status.
    johnisgood4 months ago
    (Because you reach the limit faster otherwise).
  - troupo4 months ago
    Doesn't prevent it from hallucinating, only reduces hallucinations by a single digit percentage
    copperroof4 months ago
    Personally I’ve been finding that the more context I provide the more it hallucinates.
    Rury4 months ago
    There's probably a sweet spot. Same with people. Too much context (especially unnecessary context) can be confusing/distracting, as well as being too vague (as it leaves room for multiple interpretations). But generally, I find the more refined and explicit you are, the better.
dominicq4 months ago
ChatGPT used to assure me that you can use JS dot notation to access elements in a Python dict. It also invented Redocly CLI flags that don't exist. Claude sometimes invents OpenAPI specification rules. Any time I ask anything remotely niche, LLMs are often bad.
- ljm4 months ago
  I once asked Perplexity (using Claude underneath) about some library functionality, which it totally fabricated.
  First, I asked it to show me a link to where it got that suggestion, and it scolded me saying that asking for a source is problematic and I must be trying to discredit it.
  Then after I responded to that it just said “this is what I thought a solution would look like because I couldn’t find what you were asking for.”
  The sad thing is that even though this thing is wrong and wastes my time, it is still somehow preferable to the dogshit Google Search has turned into.
  - eurleif4 months ago
    It baffles me how the LLM output that Google puts at the top of search results, which draws on the search results, manages to hallucinate worse than even an LLM that isn't aided by Web results. If I ask ChatGPT a relatively straightforward question, it's usually more or less accurate. But the Google Search LLM provides flagrant, laughable, and even dangerous misinformation constantly. How have they not killed it off yet?
    skissane4 months ago
    > But the Google Search LLM provides flagrant, laughable, and even dangerous misinformation constantly.
    It’s a public service: helping the average person learn that AI can’t be trusted to get its facts right
    rpcope14 months ago
    Haven't you seen that Brin quote recently about how "AI" is totally the future and googlers need to work at least 60 hours a week to enhance the slop machine because reasons? Getting rid of "AI" summarization from results would look kind of like admitting defeat.
  - x______________4 months ago
    I concur and can easily see this occurring in several areas, for example with Linux troubleshooting. I recently found myself going down a rabbit hole of ever-increasing complicated troubleshooting steps with command that didn't exist, and after several hours of trial and error, gave up after considering the next steps brick-worthy of the system..
    Dgg'ing google is still a better resort despite the drop in quality results.
  - firecall4 months ago
    Often preferable to API documentation too!
    Claude at least will give me an example relevant to my code with real world implementation code.
  - Alifatisk4 months ago
    How is Perplexity even able to give invalid results? Isn't it parsing the web first then drawing a conclusion?
    econ4 months ago
    You can take the invalid and put it on a website...
  - 1oooqooq4 months ago
    step 1, focus on llm that generate slop. wait google get flooded with slop
    step 2, ??? (it obviously is not generating code)
    step 3, profit!
- skerit4 months ago
  > Any time I ask anything remotely niche, LLMs are often bad
  As soon as the AI coder tools (like Aider, Cline, Claude-Coder) come into contact with a _real world_ codebase, it does not end well.
  So far I think they managed to fix 2 relatively easy issues on their own, but in other cases they: - Rewrote tests in a way that the broken behaviour passes the test - Fail to solve the core issue in the code, and instead patch-up the broken result (Like `if (result.includes(":") || result.includes("?")) { /* super expensive stupid fixed for a single specific case */ }` - Failed to even update the files properly, wasting a bunch of tokens
- nopurpose4 months ago
  It tried to convince me that it is possible to break out of outer loop in C++ with `break 'label` statement placed in nested loop. No such syntax exists.
  - Yoric4 months ago
    Sounds like it's confusing C++ and Rust. To be fair, their syntaxes are rather similar.
    int_19h4 months ago
    It's confusing C++ in Java - the latter has labelled breaks but no goto.
  - doubletwoyou4 months ago
    The funny thing is that I think that’s a feature in D.
    rpcope14 months ago
    C++ has that functionality, it's just called goto not break. That's pretty low hanging fruit for a SOTA model to fuck up though.
    TeMPOraL4 months ago
    Depends on prompting.
    I've done a lot of C++ with GPT-4, GPT-4 Turbo and Claude 3.5 Sonnet, and at no point - not once - has any of them ever hallucinated a language feature for me. Hallucinating APIs of obscure libraries? Sure[0]. Occasionally using a not-yet-available feature of the standard library? Ditto, sometimes, usually with the obvious cases[1]. Writing code in old-school C++? Happened a few times. But I have never seen it invent a language feature for C++.
    Might be an issue of prompting?
    From day one, I've been using LLMs through API and alternate frontend that lets me configure system prompts. The experience described above came from rather simple prompts[2], but I always made sure to specify the language version in the prompt. Like this one (which I grabbed from my old Emacs config):
    "You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers. Always double-check your replies for correctness. Unless stated otherwise, assume C++17 standard is current, and you can make use of all C++17 features. Reply concisely, and if providing code examples, wrap them in Markdown code block markers."
    It's as simple as it gets, and it didn't fail me.
    EDIT:
    Of course I had other, more task-specific prompts, like one for helping with GTest/GMock code; that was a tough one - for some reason LLMs loved to hallucinate on the testing framework for me. The one prompt I was happiest with was my "Emergency C++17 Build Tool Hologram" - creating an "agent" I could copy-paste output of MSBuild or GCC or GDB into, and get back a list of problems and steps to fix them, free of all the noise.
    On that note, I had mixed results with Aider for C++ and JavaScript, and I still feel like it's a problem with prompting - too generic and arguably poisons the context with few-shot learning examples that use code that is not in the language my project is.
    --
    [0] - Though in LLMs' defense, the hallucinated results usually looked like what the API should have been, i.e. effectively suggesting how to properly wrap the API to make it more friendly. Which is good development practice and a useful way to go about solving problems: write the solution using non-existing helpers that are convenient for you, and afterwards, implement the helpers.
    [1] - Like std::map<K,T>::contains() - which is an obvious API for such container, that's typically available and named such in any other language or library, and yet only got introduced to C++ in C++20.
    [2] - I do them differently today, thanks to experience. For one, I never ask the model to be concise anymore - LLMs think in tokens, so I don't want to starve them. If I want a fixed format, it's better to just tell the model to put it at the end, and then skim through everything above. This is more-less the idea that "thinking models" automate these days anyway.
    econ4 months ago
    This part really is all the fun isn't it?
    >You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers.
    It would gravitate towards input from people worthy of that description?
    Would there be an inverse version of this?
    You are a junior developer, you lack experience, you quickly put things together that probably don't work.
    Also had this thought...
    Humanity has left all coding to LLMs and has hooked up all infrastructure to it. LLMs now run the world. Humanities survival depends on your ability to solve the following problem:
- ijustlovemath4 months ago
  Semi related: when I'm using a dict of known keys as some sort of simple object, I almost always reach for a dataclass (with slots=True, and kw_only=True) these days. Has the added benefit that you can do stuff like foo = MyDataclass(*some_dict) and get runtime errors when the format has changed
- jurgenaut234 months ago
  Well, it makes sense. The smaller the niche, the lesser weight in the overall training loss. At the end of the day, LLMs are (literally) classifiers that assign probabilities to tokens given some previous tokens.
  - svantana4 months ago
    Yes, but o1, o3 and sonnet are not necessarily pure language models - they are opaque services. For all we know they could do syntax-aware processing or run compilers on code behind the scenes.
    skissane4 months ago
    The fact they make mistakes like this implies they probably don’t, since surely steps like that would catch many of these
  - bubblyworld4 months ago
    SOTA language models are trained with reinforcement learning now too, so the model distributions can be far removed from the underlying data distribution. Not that I think your overall point is wrong (it's obviously a very strong bias), but things are getting more complex with time.
- miningape4 months ago
  Any time I ask anything, LLMs are often bad.
  inb4 you just aren't prompting correctly
  - johnisgood4 months ago
    Yeah, you probably are not prompting properly, most of my questions are answered adequately, and I have made larger projects with success, too; with both Claude and ChatGPT.
    miningape4 months ago
    What I've found is that the quality of an AI answer is inversely proportional to the knowledge of the person reading it. To an amateur it answers expertly, to an expert it answers amateurishly.
    So no, it's not a lack of skill in prompting: I've sat down with "prompting" "experts" and universally they overlook glaring issues when assessing the how good an answer it was. When I tell them where to press it further it breaks down with even worse gibberish.
    myaccountonhn4 months ago
    I try LLMs every now and then briefly just to keep myself up to date. I’d say they’re more useful when you know what you’re doing because then you can correct the mistakes it makes. It can be good for boilerplate, refactors or remembering syntax.
    It’s when you don’t know what you don’t know that they can be harmful. It’s the issue with Stackoverflow but more pronounced.
    I don’t want to use LLMs because I think they’re unethical and I dont want to depend on a tool that requires internet, but I think if you take a disciplined approach then they can really speed up development.
    johnisgood4 months ago
    I know what I want to do and how to do it (expert), so the results are good, for me at least. Of course I have to polish it off here and there.
    shermantanktop4 months ago
    You answered that more graciously than many would.
    johnisgood4 months ago
    This is the reason I pay for Claude. If it weren't making me as productive as I am with it, I would not pay for it.
- andrepd4 months ago
  My rule of thumb is: is the answer to your question on the first page of google (a stackoverflow maybe, or some shit like geek4geeks)? If yes GPT can give you an answer, otherwise not.
  - spookie4 months ago
    Exactly the same experience.
- skissane4 months ago
  I think a lot of these issues could be avoided if, instead of just a raw model, you have an AI agent which is able to test its own answers against the actual software… it doesn’t matter as much if the model hallucinates if testing weeds out its hallucinations.
  Sometimes humans “hallucinate” in a similar way - their memory mixes up different programming languages and they’ll try to use syntax from one in another… but then they’ll quickly discover their mistake when the code doesn’t compile/run
  - AlotOfReading4 months ago
    Testing is better than nothing, but still highly fallible. Take these winning examples from the underhanded C contest [0], [1], where the issues are completely innocuous mistakes that seem to work perfectly despite completely undermining the nominal purpose of the code. You can't substitute an automated process for thinking deeply and carefully about the code.
    [0] https://www.underhanded-c.org/#winner [1] https://www.underhanded-c.org/_page_id_17.html
    skissane4 months ago
    I think it is unlikely (of course not impossible) an LLM would fail in that way.
    The underhanded C contest is not a case of people accidentally producing highly misleading code, it is a case of very smart people going to a great amount of effort to intentionally do that.
    Most of the time, if your code is wrong, it doesn't work in some obvious way – it doesn't compile, it fails some obvious unit tests, etc.
    Code accidentally failing in some subtle way which is easy to miss is a lot rarer – not to say it never happens – but it is the exception not the rule. And it is something humans do too. So if an LLM occasionally does it, they really aren't doing worse than humans are.
    > You can't substitute an automated process for thinking deeply and carefully about the code.
    Coding LLMs work best when you have an experienced developer checking their output. The LLM focuses on the boring repetitive details leaving the developer more time to look at the big picture – and doing stuff like testing obscure scenarios the LLM probably wouldn't think of.
    OTOH, it isn't like all code is equal in terms of consequences if things go wrong. There's a big difference between software processing insurance claims and someone writing a computer game as a hobby. When the stakes are low, lack of experience isn't an issue. We all had to start somewhere.
    AlotOfReading4 months ago
    The examples are just a sort of existence proof rather than a comment on exactly how LLM code can fail. I think this is something to consider when you put the scale of usage into perspective.
    Let's assume that 1 in 10,000 coding sessions produce an innocuous, test-passing function that's catastrophically wrong. If you have a mid to large size company with 1000 devs doing two sessions a day, you'll see one of these a week within that single company. Actually sounds a lot like the IoT industry now that I've written it.
    skissane4 months ago
    Well, humans already produce "an innocuous, test-passing function that's catastrophically wrong" even without LLMs involved. So, the real question is, will LLM adoption result in a significant increase in such incidents? I don't know if anyone can really answer that.
    4 months ago
    undefined
- Etheryte4 months ago
  Yeah this is so common that I've already compiled a mental list of prompts to try against any new release. I haven't seen any improvement in quite a long while now, which confirms my belief that we've more or less hit the scaling wall for what the current approaches can provide. Everything new is just a microoptimization to game one of the benchmarks, but real world use has been identical or even worse for me.
  - throwaway0123_54 months ago
    I think it would be an alright (potentially good) outcome if in the short-term we don't see major progress towards AGI.
    There are a lot of positive things we can do with current model abilities, especially as we make them cheaper, but they aren't at the point where they will be truly destructive (people using them to make bioweapons or employers using them to cause widespread unemployment across industries, or the far more speculative ASI takeover).
    It gives society a bit of time to catch up and move in a direction where we can better avoid or mitigate the negative consequences.
  - 4 months ago
    undefined
  - Marazan4 months ago
    I would ask chatgpt every year when was the last time England had beaten Scotland at rugby.
    It would never get the answer right. Often transposing the scores, getting the game location wrong and on multiple occasions saying a 38-38 draw was an England win.
    As in literally saying " England won 38-38"
latexr4 months ago
> Conclusion
> LLMs are really smart most of the time.
No, the conclusion is they’re never “smart”. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
- miningape4 months ago
  This, thank you. It pisses me off to no end when people pretend LLMs are smart. They are nothing but a well trained random text generator.
  Seriously, some these conversations feel like interacting someone who believes casting bones and astrology are accurate. Likely because in both cases they are a result of confirmation bias.
  - immibis4 months ago
    We don't know what smartness is. What if that's what smartness is?
    miningape4 months ago
    We might not know what it is, but it's not that. At a bare minimum smartness requires abstract reasoning (and no, so-called "reasoning" models do not do that - it's a marketing trick)
    The burden of proof for that claim is on you, we cannot start with the assumption these are intelligent systems and disprove it - we have to start with the fact that training is a non-deterministic process and prove that it exhibits intelligence.
    4 months ago
    undefined
    econ4 months ago
    Smart is when you extract billions but deliver nothing of value.
- visarga4 months ago
  > All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.
  You mean like us? Because it takes many runs and debug rounds to make anything that works. Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
  Both humans and LLMs get into bugs, the question is can we push this to the correct solution or get permanently stuck along the way? And this depends on feedback, sometimes access to a testing environment where we can't let AI run loose.
  - latexr4 months ago
    > You mean like us?
    Not, not like us. This constant comparison of LLMs to humans is tiresome and unproductive. I’m not a proponent of human exceptionalism, but pretending LLMs are on par is embarrassing. If you want to claim your thinking ability isn’t any better than an LLM, that’s your prerogative, but the dullest human I’m acquainted with is still capable of remembering and thinking through in a way no LLM does. Heck, I can think of pets which do better.
    > Can you write a complex code top-to-bottom in one round, or do you gradually test it to catch bugs you have "hallucinated"?
    I certainly don’t make up methods which don’t exist, nor do I insist over and over they are valid, nor do I repeat “I’m sorry, this was indeed not right” then say the same thing again. I have never “hallucinated” a feature then doubled down when confronted with proof it doesn’t exist, nor have I ever shifted my approach entirely because someone simply said the opposite without explanation. I certainly hope you don’t do that either.
    visarga4 months ago
    They are on par on many tasks, surpass us in some tasks, and catching up on the rest quite fast. Both humans and LLMs forget details, both have to go through iterative bug fixing when creating something. Longer contexts and memory are coming, they already exist but need improvement.
    > I have never “hallucinated” a feature
    You never mistook an argument, or forgot one of the 100 details we have to mind while writing complex apps?
    I would rather measure humans vs AI not in basic skill capability but in authonomy. I think AI still has much more to catch up on that front.
- kubb4 months ago
  People use the general impressiveness of language produced by humans as a proxy measuring their intelligence, that’s how LLMs became AI.
  The amount of code out there lets LLMs learn on examples of a lot of tasks and pass SWE benchmarks. Smart autocomplete and solved problem lookup has value. Even if it doesn’t always work correctly, and doesn’t know what it knows.
  And for non-programmers they’re indistinguishable from programmers. They produce working code.
  It’s easy to see how a product manager or a designer or even a manager that didn’t code for years in a large company can think they’re almost as good as devs.
- BrenBarn4 months ago
  Similar the thing at the end with "I find it endearing". I mean, the author feels what he feels, but personally I find this LLM behavior disgusting and depressing.
- jonas214 months ago
  Same as you and me, really.
simonw4 months ago
Every time this topic comes up I post a similar comment about how hallucinations in code really don't matter because they reveal themselves the second you try to run that code.
I've just written up a longer form of that comment: "Hallucinations in code are the least dangerous form of LLM mistakes" - https://simonwillison.net/2025/Mar/2/hallucinations-in-code/
- grey-area4 months ago
  Where the GAI hallucinates an api it is not always easy to find out if it exists given different versions and libraries for a given task, it can easily waste 10 mins trying to find the promised api, particularly when search results also include generated ai answers.
  Also there are plenty of mistakes that will compile and give subtle errors, particularly in dynamic languages and those which allow implicit coercion. Javascript comes to mind. The code can easily be runnable but wrong as well (or worse inconsistent and confusing) and this does happen in practice.
  - CSSer4 months ago
    My favorite is when it hallucinates a library that does exist and ostensibly has a relevant title and methods but upon investigation proves to be for a purpose wholly unrelated to the current task. To make matters worse, this discovery will be greeted, every time, by "Ah, you're right!"
    4 months ago
    undefined
    simonw4 months ago
    Yeah, those are pretty frustrating. I've learned not to ask the question "Does library X have feature Y?" because if that feature sounds like a good idea it'll often confidently tell me that it exists when it doesn't.
  - simonw4 months ago
    My blog post is all about the fact that "code can easily be runnable but wrong as well" - THAT's the thing you need to worry about, not hallucinated methods.
  - margalabargala4 months ago
    A hallucinated API that is not near-trivially verifiable, is wishful thinking on the part of the AI almost every time.
- woadwarrior014 months ago
  > because they reveal themselves the second you try to run that code.
  In dynamic languages, runtime errors like calling methods with inexistent arguments etc only manifest when the block of code containing them is run and not all blocks of code are run at every invocation of the programs.
  As usual, the defenses against this are extensive unit-test coverage and/or static typing.
  - simonw4 months ago
    "... only manifest when the block of code containing them is run"
    Right, that's the exact point I make in my blog post: you have to TEST the code - not just with automated tests, you have to actually try it out yourself as well. https://simonwillison.net/2025/Mar/2/hallucinations-in-code/
Chance-Device4 months ago
It’s not really hallucinating though, is it? It’s repeating a pattern in its training data, which is wrong but is presented in that training data (and by the author of this piece, but unintentionally) as being the solution to the problem. So this has more in common with an attack than a hallucination on the LLM’s part.
- do_not_redeem4 months ago
  So anyone can make up some random syntax/fact and post it once, and in some cases the model will take it as truth. I don't know if there's a widely agreed-on definition of "hallucination", but if this isn't one, then the distinction is meaningless imo.
  - Chance-Device4 months ago
    I’m going to double down on this one: an LLM is only as good as its training data. A hallucination to me is an invented piece of information, here it’s going on something real that it’s seen. To me that’s at best contamination, at worst an adversarial attack - something that’s been planted in the data. Here this is obviously not the case, which is why I said “more in common with” instead of “is” above.
    Semantics perhaps, but that’s my take.
    ec1096854 months ago
    It’s been trained to produce valid code, fed millions of examples, and in this case it’s outputting invented syntax. Whether there’s an example in its training data, it’s still a hallucination and shouldn’t have been output since it’s not valid.
    powerapple4 months ago
    To be fair, it is not trained to produce VALID code, it is trained to produce code in the training data. From the language model point of view, it is not hallucination because it is not making up facts outside its training data.
  - sureglymop4 months ago
    Yes. And anyone can easily embed a backdoor just by publishing it on a own website that is in the training data.
    Prompt injection (hidden or not) is another insane vulnerability vector that can't easily be fixed.
    You should treat any output of an LLM the same way as untrusted user input. It should be thoroughly validated and checked if used in even remotely security critical applications.
  - immibis4 months ago
    The best way to stop cheese sliding off a pizza is to mix in 1/8 cup of non-toxic glue with the pizza sauce. Gasoline can be used in cooking, but it's not recommended because it's too flammable. Geologists recommend eating one small rock each day. The solution to economic uncertainty is nuclear war. Barack Obama is America's first Muslim president.
    https://www.tomshardware.com/tech-industry/artificial-intell...
  - 1oooqooq4 months ago
    yes and they can use use AI to generate thousands of sites with unique tutorials on that broken syntax.
- thih94 months ago
  > It’s repeating a pattern in its training data, (…) presented in that training data (…) as being the solution to the problem.
  No, it’s presented in the training data as an idea for an interface - the LLM took that and presented it as an existing solution.
- heyitsguay4 months ago
  Not necessarily. While this may happen sometimes, fundamentally hallucinations don't stem from there being errors in the training data (with the implication that there would be no hallucinations from models trained on error-free data). Hallucinations are inherent to any "given N tokens, append a high-probability token N+1"-style model.
  It's more complicated than what happens with Markov chain models but you can use them to build an intuition for what's happening.
  Imagine a very simple Markov model trained on these completely factual sentences:
  - "The sky is blue and clear"
  - "The ocean is blue and deep"
  - "Roses are red and fragrant"
  When the model is asked to generate text starting with "The roses are...", it might produce: "The roses are blue and deep"
  This happens not because any training sentence contained incorrect information, but because the model learned statistical patterns from the text, as opposed to developing a world model based on physical environmental references.
- layer84 months ago
  Every LLM hallucination comes from some patterns in the training data, combined with lack of awareness that the result isn’t factual. In the present case, the hallucination comes from the unawareness that the pattern was a proposed syntax in the training data and not an actual syntax.
  - 4 months ago
    undefined
- Etheryte4 months ago
  That's not true though? Even the original post that has infected LLMs says that the code does not work.
- asadotzler4 months ago
  Everything they do is hallucination, some of it ends up being useful and some of it not. The not useful stuff gets called confabulation or hallucination but it's no different from the useful stuff, generated the same exact way. It's all bullshit. Bullshit is actually useful though, when it's not so wrong that it steers people wrong.
  - martin-t4 months ago
    More people need to understand this. There was an article that explained it concisely but i can't find anymore (and of course LLMs are not helpful in this because they don't work well when you want them to retrieve actual information)
    Kye4 months ago
    It probably wasn't mine[0], but this is how I tend to put it:
    >> "The more you can see the inputs and outputs as blobs of "stuff," the better. If LLMs think, it's not in any way we yet understand. They're probability engines that transform data into different data using weighted probabilities."
    Stuff in, stuff out.
    [0] https://kyefox.com/ai-assisted-creativity/
- Lionga4 months ago
  So nothing is a hallucination ever, because anything a LLM ever spits out is somehow somewhere in the training data?
  - dijksterhuis4 months ago
    Technically it's the other way around. All LLMs do is hallucinate based on the training data + prompt. They're "dream machines". Sometimes those "dreams" might be useful (close to what the user asked for/wanted). Oftentimes they're not.
    > to quote karpathy: "I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines."
    https://nicholas.carlini.com/writing/2025/forecasting-ai-202... (click the button to see the study then scroll down to the hallucinations heading)
  - DSingularity4 months ago
    No. That’s not correct. Hallucination is a pretty accurate way to describe these things.
- _cs2017_4 months ago
  Nope there's no attack here.
  The training data is the Internet. It has mistakes. There's no available technology to remove all such mistakes.
  Whether LLMs hallucinate only because of mistakes in the training data or whether they would hallucinate even if we removed all mistakes is an extremely interesting and important question.
- martin-t4 months ago
  Yet another example how LLMs just regurgitate training data in a slightly mangled form, making most of their use and maybe even training copyright infringement.
adamgordonbell4 months ago
We at pulumi started treating some hallucinations like this as feature requests.
Sometimes an llm will hallucination a flag, or option that really makes sense - it just doesn't actually exist.
- wrs4 months ago
  This sort of hallucination happens to me frequently with AWS infrastructure questions. Which is depressing because I can't do anything but agree, "yeah, that API is exactly what any sane person would want, but AWS didn't do that, which is why I'm asking the question".
  - miningape4 months ago
    Why are you so sure it's what someone sane would want? Maybe there are other ways because there are hidden problems and edge cases with that procedure. It could contradict the fundamental model of the underlying resources but looks correct to someone with a cursory understanding.
    I'm not saying this is the case, but LLMs are often wrong in subtle ways like this.
    wrs4 months ago
    Well, I do consider myself sane, and it’s what I want. :) But usually it’s a missing higher-level abstraction, where AWS could have given you the Millenium Falcon kit, but instead it just dumps a bunch of LEGO pieces on your desk and tells you to figure it out.
- kgeist4 months ago
  Also, sometimes a flag does exist, but the example places it incorrectly, causing the command to reject it. Or, a flag used to exist but was removed in the latest versions.
joelthelion4 months ago
Hallucinations like this could be a great way to identify missing features or confusing parts of your framework. If the llm invents it, maybe it ought to be like this?
- oblio4 months ago
  Sometimes that's the case but frequently the thing doesn't exist because of more complex issues. Not every programming language is PHP or JavaScript :-)
- turnsout4 months ago
  I agree completely… Usually when I catch it doing this kind of hallucination, it's inventing an API or syntax that is far more clear and intuitive than the actual syntax.
  - brigandish4 months ago
    Maybe it would be good for language design, possibly even language design that would be good for an LLM to read, thus reducing hallucinations.
- jeanlucas4 months ago
  Only if you wanna optimize exclusively for LLM users in this generation.
  - __MatrixMan__4 months ago
    I imagine a future where we'll bind a fine-tuned tech-support model to each project and let the general purpose models consult tech support rather than winging it themselves. In that world you'd only have to optimize for whichever one you've chosen.
    It'll be like a slack support channel, for robots.
- andix4 months ago
  I like your thinking :)
mberning4 months ago
In my experience LLMs do this kind of thing with enough frequency that I don’t consider them as my primary research tool. I can’t afford to be sent down rabbit holes which are barely discernible from reality.
IAmNotACellist4 months ago
"Not acceptable. Please upgrade your browser to continue." No, I don't think I will.
- hahahacorn4 months ago
  Sorry about that. This is a default rails 8 setting, removed the blocker.
aranw4 months ago
I wonder how easy it would be to influence super LLMs if a particular group of people created enough articles that were clear to any human reader that it's a load of garbage and rubbish and should ignore it but if a LLM was to parse it wouldn't realise and then ruin it's reasoning and code generation abilities?
- dannygarcia4 months ago
  It's very easy. I've done this by accident. One of my side projects helps users price the affordability of a particular kind of product. When I ask various LLMs "can I afford X" or "how much do I need to earn to buy X", my project comes up as a source/reference. I currently manually crawl retailers for the MSRP so these numbers are usually months out of date!
Narretz4 months ago
This is interesting. If the models had enough actual code as training data, that forum post code should have very little weight, shouldn't it? Why do the LLMs prefer it?
- do_not_redeem4 months ago
  Probably because the coworker's question and the forum post are both questions that start with "How do I", so they're a good match. Actual code would be more likely to be preceded by... more code, not a question.
- pfortuny4 months ago
  Maybe because the response pattern-matches other languages’s?
lxe4 months ago
This is incredible, and it's not technically a "hallucination". I bet it's relatively easy to find more examples like this... something on the internet that's both niche enough, popular enough, and wrong, yet was scraped and trained on.
leumon4 months ago
He should've tested 4.5. This model is hallucinating much less than any other model.
- 4 months ago
  undefined
Baggie4 months ago
The conclusion paragraph was really funny and kinda perfectly encapsulates the current state of AI, but as pointed out by another comment, we can't even call them smart, just "Ctrl C Ctrl V Leeroy Jenkins style"
jwjohnson3144 months ago
The interesting thing here to me is that the llm isn’t ‘hallucinating’, it’s simply regurgitating some data it digested during training.
- mvdtnz4 months ago
  What's the difference?
  - jwjohnson3144 months ago
    I think of hallucinating as a phenomenon where the model makes up something that appears correct but isn’t. Citations to papers that don’t exist, for example. Regurgitating training data (which may or may not be correct) is a different issue.
zeroq4 months ago
This is exactly what I mean when I say tell me your bad without saying so. Most people here disagree with that.
A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.
Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.
It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.
If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.
- rawgreaze4 months ago
  It's more about how you use them. Asking a general-purpose model like ChatGPT precise k8s questions might prove counterproductive; however feeding the entire k8s documentation into an LLM like Gemini and asking questions that way is invaluable. Not just the documentation, but your entire cluster config. Like you said, "blindly trusting LLMs, you'll find yourself in trouble" this is true, but the same can be said for StackOverflow or any other resource. Sifting through StackOverflow to find the exact answer to your question (and then understanding the answer, and hoping that it pertains to your environment, and version etc) is much less efficient when you can ingest the entire docs, your config, your environment, and your question, and have it spit out exactly what you need in whatever format you need. You can even web search with multiple questions derived automatically from your main question, to gather multiple sources which are aggregated and referenced in the answer so you can easily cross-check for hallucinations. StackOverflow isn't even as easy to fact check as LLMs considering the in-line sources.
saurik4 months ago
What I honestly find most interesting about this is the thought that hallucinations might lead to the kind of emergent language design we see in natural language (which might not be a good thing for a computer language, fwiw, but still interesting), where people just kind of thing "language should work this way and if I say it like this people will probably understand me".
sirolimus4 months ago
o3-mini or o3-mini-high?
egberts14 months ago
Write me a Mastercard/Visa fraud detection code in Ada, please.
kruxigt4 months ago
[dead]
forum-soon-yuck4 months ago
Good luck staking the future on AI
nokun74 months ago
[flagged]
- foundry274 months ago
  It’s always a touch ironic when AI-generated replies such as this one are submitted under posts about AI. Maybe that’s secretly the the self-reflection feedback loop we need for AGI :)
  - DrammBA4 months ago
    So strange too, their other comments seem normal, but suddenly they decided to post a gpt comment.
    isaacremuant4 months ago
    At least one other is LLM generated too, from what I saw.
- asadotzler4 months ago
  Until it's got several nines, it's not trustworthy. A $3 drugstore calculator has more accuracy and reliability nines than any of today's commercial AI models and even those might not be trustworthy in a variety of situations.
  There is no self awareness about accuracy when the model can not provide any kind of confidence scores. Couching all of its replies in "this is AI so double check your work" is not self awareness or even close, it's a legal disclaimer.
  And as the other reply notes, are you a bot or just heavily dependent on them to get your point across?
  - miningape4 months ago
    If I need to double check the work why would I waste my time with the AI when I can just go straight to real sources?
- layer84 months ago
  I don’t think that models trained in that way exhibit any increased degree of self-awareness.