For example, yesterday I got a list of some study resources for abstract algebra. Claude referred me to a series by Benedict Gross (Which is excellent btw). It gave me a line to harvard’s website but it was a 404 and it was only with further searching that I found the real thing. It also suggested a youtube playlist by Socratica (again this exists but the url was wrong) and one by Michael Penn (same deal).
Literally every reference was almost right but actually wrong. How does anyone have the confidence to ship a legal brief that an AI produced without checking it thoroughly?
Casually scrolling through TechCrunch I see over $1B in very recent investments into legal-focused startups alone. You can't push the messaging that the technology to replace humans is here and expect people will also know intrinsically that they need to do the work of checking the output. It runs counter to the massive public rollout of these products which have a simple pitch: we are going to replace the work of human employees.
And I don’t mean essays edited with chatGPT, but essays that are clearly verbatim output. When the teacher asks the students to read them out loud to the class, they will stumble upon words and grammar that are way obviously way beyond anything we’ve studied. The utter lack of self awareness is both funny but also really sad.
> [learning a language is] only helpful for high professional tasks or close literary study or prestige.
This is a person who doesn't understand actual face-to-face communication, like, at all. Even though translation apps are amazing, in a social interaction, there's no getting over the imposition of the halting, hesitant back-and-forth of device-assisted translation. Sure, you can almost always eventually get your point across, but you're never going to set the other party at ease in the same way as speaking their language yourself.
When they’re on vacation? Very few people are going to learn a language that they could use for a week or two in a place where people probably speak English better than whatever language you’re attempting anyways.
Obviously there are exceptions.
My social studies teacher in 8th grade throughout the year would give us a list of things and phrases by decade. Loved these assignments. Totally threw myself into them. Years later, I was working at a prep school and wanted the students to be assigned Billy Joel's We Didn't Start the Fire as a similar assignment (competition?). Teachers thought it was stupid, that the kids wouldn't like it (um, who cares?), and that it was too hard. Teacher's responses only confirmed what I thought about the school (meh) and that it was a sad day that curiosity was a bad thing. (I was the fundraiser for the school so I didn't really interact with the kids a lot but ones I knew would have had fun with such a project).
Anyway, the Lindell lawyers must have gone to this school, or one like it. How is it ever okay to do this and think it's a good idea? And, how the heck did these people pass the Bar?
Edit: List of references in We Didn't Start the Fire https://en.wikipedia.org/wiki/List_of_references_in_We_Didn%.... Gonna blog the references and this post on my little policy blog this week :-)
Just two days ago, I gave it a list of a dozen article titles from a newspaper website (The Guardian), asked it to look up their URLs and give me a list, and to summarise each article for me, and it made no mistakes at all.
Maybe your task was more complicated to do in some way, maybe you're not paying for ChatGPT and are on a less able model, or maybe it's a question of learning how to prompt, I don't know, I just know that for me it's gone from "assume sources cited are bullshit" to "verify each one still, but they're usually correct".
Something missing from this conversation is whether we're talking about the raw model or model+tool calls (search). This sounds like tool calls were enabled.
And I do think this is a sign that the current UX of the chatbots is deeply flawed: even on HN we don't seem to interact with the UI components to toggle these features frequently enough that they're the intuitive answer, instead we still talk about model classes as though that makes the biggest difference in accuracy.
But the reason I suggested model as a potential difference between me and the person I replied to, rather than ChatGPT interface vs. plain use of model without bells and whistles, is that they had said their trouble was while using ChatGPT, not while using a GPT model over the API or through a different service.
[#] (Technically I didn't, and never do, have the "search" button enabled in the chat interface, but it's able to search/browse the web without that focus being selected.)
And on the flip side, my local Llama 3 8b does a pretty good job at avoiding hallucinations when it's hooked up to search (through Open WebUI). Search vs no-search seems to me to matter far more than model class.
These models aren't (yet, at least) clever enough to understand what they do or don't know, so if you're not directly telling them when you want them to go and find specific info rather than guess at it you're just asking a mystic with a magic ball.
It doesn't add much to the length of prompts, just a matter of getting in the habit of wording things the right way. For the request I gave as my example a couple of comments above, I wrote "Please search for every one of the Guardian articles whose titles I pasted above and give me a list of URLs for them all." whereas if you write "Please tell me the URLs of these Guardian articles" then it may well act as if it knows them already and return bullshit.
You cannot ask it to have crop yield as a column in a chart and get accurate information.
It only seems reasonable when doing a single list of items. Asking it for two columns of data and it starts making things up. Like bogus wikipedia links.
You could definitely make the argument I'm using it wrong but this is how people try to use it. I still find this useful because it gives me a start on where to point my research or ask clarifying questions.
It's much better at giving you a list of types of beer and wine that's been produced in history. Just don't trust any of the dates.
I would like a list of east Indiamen from 1750 to 1800 where you can find how many tons burthen and how many crew. Show as a chart and give me the wikipedia links to the ships. Do not include any ships that do not have wikipedia links.
Here's my customization:
What do you do?:
Software Engineer
What traits should ChatGPT have?:
Show all the options
Be practical above all.
Anything else ChatGPT should know about you?:
I’m an author of science fiction and fantasy.
I like world building for stories.
I know there's hundreds of ways to phrase this and I could probably trick it into generating the chart first and finding the wikipedia links second. :)I'm not sure which update improved 4o so greatly but I get better responses from 4o than from o4-mini, o4-mini-high, and even o3. o4 and o3 have been disappointing lately - they have issues understanding intent, they have issues obeying requests, and it happened multiple times that they forgot the context even though the conversation consisted of only 4 messages without a huge number of tokens. In terms of chain-of-thought models I prefer DeepSeek over any OpenAI model (4.5 research seems great, but it’s just way too expensive).
It's rather disappointing how OpenAI releases new models that seem incredible, and then, to reduce the cost of running them, they slowly slim these models down until they're just not that good anymore.
Make a claim, prove it, especially one so easily proven.
It has likely never occurred to them that such checks are necessary. Why would it, if they've never performed such checks, nor happen to have been warned by AI critics?
>For example, yesterday I got a list of some study resources for abstract algebra. Claude referred me to a series by Benedict Gross (Which is excellent btw). It gave me a line to harvard’s website but it was a 404 and it was only with further searching that I found the real thing. It also suggested a youtube playlist by Socratica (again this exists but the url was wrong) and one by Michael Penn (same deal).
FWIW, I've found Penn's content to be quite long-winded and poorly edited. The key idea being presented often makes up hardly any of the video's runtime, so I'm just sitting there watching the guy actually write out the steps solving an equation (and making trivial errors, and not always correcting them).
Because its makers don't care about precision or correctness. They care only about convincing the people that matter that gaping software bugs are "hallucinations" that can never be fixed 100%, and that that is an acceptable outcome.
For every story about AI-assisted legal briefs with tons of mistakes, there will be 100 PR-ridden pieces about how $LATEST_VERSION has passed the bar exam or has discovered a miracle drug. There will never be a story about AI successfully arguing a real case however, because the main goal was only about selling a vision of labor replacement. Whether or not the replacement can do the job as specified is immaterial.
I could see that, especially with sloppy lawyers in the first place. Or, I could see it being a convenient "the dog ate my homework" excuse.
I think people who do are simply not aware that AI is not deterministic the same way a calculator is. I would feel entirely safe signing my name on a mathematical result produced by a calculator (assuming I trusted my own input).
The problem is that all output is a "hallucination", and only some of it coincidentally matches the truth. There's no internal distinction between hallucination and truth.
[0] Theoretically; race conditions in a parallel implementation could add non-determinism.
Which doesn't detract from your main point: there's not a lot of distinction between hallucinations and what we'd consider to be the "real thing." There have been various attempts to measure hallucinations, and we can figure out things like how confident the model is in a particular answer...but there's nothing grounding that answer. Saturate the dataset with the wrong answer and you'll get an overconfident wrong result.
And while it might be important in some contexts, like debugging using either the exact same or different seeds, isn't this one of them where it rather confuses the issue ?
Hallucinations happen when the model determines that the most likely suitable string of tokens turns out to contain incorrect information, regardless of whether the correct information is "missing" or whether the correct information actually would have been outputted had it, when selecting the first token of the response, instead selected the option that it considers second best rather than best.
Whether or not a piece of information was in the training set can obviously influence the likelihood of a model hallucinating when asked about the subject, but it can easily hallucinate about stuff that was in the training and it can also get things right that weren't in the training data.
Or at least this is how I interpret the term.
> "If an LLM happens to know the answer to your question, that answer will have the greatest weight"
An LLM doesn’t “know” anything in the way you’re imagining. It doesn’t have stored facts or indexed knowledge to check against, it just has weights learned between token sequences, and it outputs whatever next token is assigned the highest probability given the prompt and prior context. That might happen to produce a correct answer (and people are obviously working hard to make the models produce right answers as often as possible), but it might just as easily produce a plausible-sounding but wrong one, even if the correct information was in the training data. Because that correct information being there doesn't guarantee it will have the highest weighting ever, yet alone the highest weighting in all contexts of previous tokens and in all temperature settings.
You’re right that hallucinations can sometimes look like “extrapolations” that happen to land correctly, but that’s incidental. It’s still doing the same token-by-token probability selection regardless of whether it ends up right or wrong.
Framing it around “missing knowledge” vs “existing knowledge” is misleading intuition. It’s better to think about it in terms of probability distributions over token sequences: the model’s training biases it toward correct sequences more often than incorrect ones, but there’s nothing fundamental in the architecture that guarantees that if the answer was present in training, it will always beat out wrong guesses.
p.s. It's late at night here and I'm about to go to bed, so apologies if I've not explained well in this comment - I gave it to ChatGPT hoping it could tidy things up for me and it just made a way more confusing version so I'm posting it as is :D Let me know if my explanation still isn't clear and I could try again, or answer any questions you have, tomorrow
Neither does your brain and yet you do "know" something.
> but it might just as easily produce a plausible-sounding but wrong one, even if the correct information was in the training data
If the majority of information that was in the LLM's training data said 1 + 1 = 3, the LLM will tell you that 1 + 1 = 3, even if there was some information that said 1 + 1 = 2, and there's nothing wrong with that because the LLM is not supposed to fact-check.
> the model’s training biases it toward correct sequences more often than incorrect ones
No, the model's training biases it toward sequences that appear more frequently.
https://chatgpt.com/share/680dc86c-f0dc-800d-9f04-57ba2f126a...
https://chatgpt.com/share/680dc90b-de28-800d-92b6-f2ef824777...
Note how applying increasing pressure to answer was what caused the hallucination: hallucinations aren't tied to if the model "knows" something.
Once the tokens output don't fall into the start of some varation of "I don't know", the model is going to answer regardless of what it knows.
Reasoning doesn't solve it. O3 thinks and comes up empty, and O4-mini thinks and then hallucinates much worse than even 4o did: https://chatgpt.com/share/680e7a5c-ae3c-8004-bff9-5c8a0215a1...
You're missing the point. It doesn't "know" anything. The only thing it can "know" is the statistical relationships between tokens in its dataset. It doesn't "know" anything about the meaning of those tokens. It doesn't even "know" whether it "knows" anything or not. The best it can do is "Here's a recursively generated string of ASCII codes that are statistically likely to follow each other according to the data corpus."
It's Rashomon. It can point you in the right directions a lot of the time, but there's no getting around the fact that you have to double-check its answers with external sources.
> Or at least this is how I interpret the term.
That's not a very useful interpretation because it's not grounded in technical reality.
The word know is an abstraction I use in order to avoid going into technical details.
> That's not a very useful interpretation because it's not grounded in technical reality.
My interpretation aligns with what people generally mean by hallucination, and it's definitely more useful than saying that any output is hallucination.
I'm afraid I don't personally see how to explain more clearly, so will just say instead that given multiple people are in this thread telling you your understanding of how LLMs work isn't right, please consider that to at least be a possibility and look into it further rather than digging deeper into your current beliefs.
They're treating it like they would a paralegal. Typically this means giving a research task and then using their results, but sometimes lawyers will just have them write documents and ship it, so to speak.
This is making me realize that Tech Bros treat chat GPT like the 1930s secretary they never got to have
Glad to see that this is the outcome. Similar to bribes and other similar issues, the hammer has to be big and heavy so that people stop considering this as an option.
30+ years ago when I was in law school [1] I would practice legal research by debunking sovereign citizen and related claims on Usenet. The errors listed above are pretty much a catalog of common sovereign citizen legal research errors.
Just add something about gold fringed flags and Admiralty jurisdiction and it would be nearly complete.
The sovereign citizen documents I debunked were usually not written by lawyers. At best the only legal experience the authors usually had was as defendants who had represented themselves and lost.
Even they usually managed to only get a couple major errors per document. That these lawyers managed to get such a range of errors in one filing is impressive.
[1] I am not a lawyer. Do not take anything I write as legal advice. Near the end of law school I decided I'd rather be a programmer with a good knowledge of law than a lawyer with a good knowledge of programming and went back to software.
Way too many people think that LLMs understand the content in their dataset.
What annoys me more about this type of response is that I feel there's a less rude way to express the same.
The ChatGPT responses seem to generally be in the tone of someone who has a harder question that requires a human (not googleable), and the laziness is the answer, not the question.
In my view the role of who is wasting others time with laziness is reversed.
_Everything_ that the magic robot spits out needs to be fact checked. At which point, well, really, why bother? Most people who depend upon the magic robot are, of course, not fact checking, because that would usually be slower than just doing the job properly from the start.
You also see people using magic robot output for things that you _couldn't_ Google for. I recently saw, on a financial forum, someone asking about ETFs vs investment trusts vs individual stocks with a specific example of how much they wanted to invest (the context is that ETFs are taxed weirdly in Ireland; they're allowed accumulate dividends without taxation, but as compensation they're subject to a special gains tax which is higher than normal CGT, and that tax is assessed as if you had sold and re-bought every eight years, even if you haven't). Someone posted a ChatGPT case study of their example (without disclosing, tsk; they owned up to it when people pointed out that it was totally wrong).
ChatGPT, in its infinite wisdom, provided what looked like a detailed comparison with worked examples... only the timescale for the individual stocks was 20 years, the ETFs 8 years (also it screwed up some of the calculations and got the marginal income tax rate a few points wrong). It _looked_ like something that someone had put some work into, if you weren't attuned to that characteristic awful LLM writing style, but it made a mistake that it's hard to imagine a human ever making. Unless you worked through it yourself, you'd come out of it thinking that individual stocks were clearly a _way_ better option; the truth is considerably less clear.
Ah, we didn't knew just how good we had it...
(At least it is (was ?) real humans doing the writing, you can look at modification history, well made articles have sources, and you can debate issues with the article in the Talk page and even maybe contribute directly to it...)
Outside of the literal dollar cost, the opportunity cost here is further delays on the docket because the clerk was unable to do something else, and the court time that must now be spent dealing with the issue.
Second, the mistakes weren't just incorrect citations any paralegal could check
He’s a bankrupt, likely mentally ill acolyte of a dude who is infamous for stiffing his lawyers. His connection with reality is tenuous at best.
Bingo. This has nothing to do with ideology. Good lawyers like to win. And when a client is demonstrably too stupid to let them do that, why bother.
That leaves only those as lawyers who already have zero reputation left to lose, want to make a name for themselves in the far-right scene, who are members of the cult as well, and those who think they can milk an already dead/insolvent horse.
Jones is a good example of this. He cycled through about 20 different lawyers during the sandyhook trials. The reason he was defaulted is because when he was required to produce something, he fire the lawyers (or they'd quit), hire new ones, and invariably in the depositions an answer to "did you bring this document the court mandated that you produce" the answer was "oh, sorry, I'm brand new to this case and didn't know anything about that".
Jones wasn't cooperating with his lawyers.
There are plenty of good lawyers that have no problem representing far right figures. The issue really comes down to those figures being willing to follow their lawyer's advice.
The really bad lawyers simply don't care if their clients ignore their advice.
Everything the right accuses anyone of, they're doing it too. That's why they don't really care about criminals and pedophiles and racists in their ranks. They think everyone is a child diddling criminal racist.
Rememebr Michael Avenatti?
I don’t think that nexus is political, for either party. It’s all tied to one man.
But here we have an example of someone not escaping justice due to his now-evaporated wealth. I'd call it a positive.
https://www.sfgate.com/bayarea/article/controversy-californi...
The lawyer jokes aren't funny anymore...
A much worse failure seems to have been the incompetent software to run the tests. And that for something as high level they would have decided to do it through the mediation of a computer as well as used multiple choice questions in the first place.
Although different states are involved, perhaps this goes some way toward explaining how Lindell's lawyers could have passed their bar exams.
Bar exams are funny things. Most states have a reciprocity with the NY bar, so when you think lawyer, think the NY bar.
But California is considered a harder bar to pass and has little reciprocity.
Somewhat surprisingly the hardest bar is Louisiana's. This is because their legal system is a crawdad fucking mess. They inherited their code based system from the French for a lot of local matters, but then also have to deal with the precedent based system the rest of the US uses. So you have to memorize two completely different types of law at a very high level. So, if you ever meet a Louisiana lawyer, you know you've met a very intelligent and dedicated person.
Obligatory IANAL here.
I mean, that's always been tech's modus operandi....
And, I mean, they're probably right, because, well, see the pillow guy's lawyer.
These stories are important, you personally don't have to read them if you're tired. But the more cases there are the bigger the extant threat, and the more we need to be educated so we can defend against it.
We are all going to be affected by the omnipresent reliance on AI that allows people to rush out their tasks and get home from work sooner.
1. Use reasoning models and include in the prompt to check the cited cases and verify holdings. 2. Take the draft, run it through ChatGpt deep research , Gemini deep research and Claude , and tell it to verify holdings.
I still double check, for now, but this is catching every hallucination.
Whew, that's 4 LLM inference requests and still requires manual checking. Criminal levels of waste and inefficiency. Learn how to use LexisNexis, spend some time in a law library handling actual physical casebooks. Learn to do your job.
And, part of the process is to do some research first, find the key cases, and the briefs of better lawyers on the same issue, and include them in the context.
LexisNexis rates vary quite a lot but $200/month for a small law firm is in the ballpark.
With the Court's reply to Lindell, you now have an independent test case upon which to test your verification process and compare results against a "rival implementation" -- the Court's. One wonders if it may be AI-assisted as well. I'd be quite interested in hearing how the two stack up.
From the article, it looks like this brief was dated Feb 25 this year.