I'd also hesitate to attribute this difference in hallucination rates purely to model size. Yes, GLM-5.2 hallucinates much less frequently than DeepSeek-V4 Pro with twice as many parameters, but DeepSeek-V4 Flash is less than half the size of GLM-5.2 and tops the AA-Omniscience hallucination index. Opus 4.8, which is likely larger than DeepSeek-V4 Pro, has a 36% hallucination rate on the index, above GLM-5.2's 28%, but way below the DeepSeek numbers. Opus also has a 47% accuracy rate vs GLM-5.2's 25%. If you use these numbers to calculate the absolute hallucination rate (i.e., the number of hallucinated responses divided by the total number of responses), you get 19% for Opus and 21% for GLM-5.2.
So yes, all else equal larger models may be more prone to hallucination in scenarios where they don't know the answer, but there are a lot of other factors that affect hallucination rates, and it's not totally clear that this is the main metric that's worth tracking.
If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult
Do you have a cite for this?
If a human makes up some bullshit lie, I wouldn't accuse them of making it up only if they actually knew the correct answer. If you don't know, the only correct answer is I don't know. Any other answer is made up bullshit. Why is it only a hallucination if and only if the LLM contains the answer? If you make something up it's still wrong. It shouldn't matter if you could give the correct answer. You didn't, and instead invented some bullshit instead?
Follow up question, how can I apply this rule set to the next test I have to take? I'd love to be able to use "I didn't know" as the excuse for why I made something up.
edit:
> and it's not totally clear that this is the main metric that's worth tracking.
I don't know, the rate at which some model is willing to make up something feels useful. If the argument I see repeated on HN so much is that it's impossible to completely get rid of hallucinations; being able to choose a model that's less likely to invent some lie seems like a positive trait, no?
Either way, I'm happy to agree that a restrictive definition, where a lie doesn't count as a hallucination iff the model doesn't know the answer feels strictly, infinitely less useful than an exact error rate. What percentage of emitted tokens are misleading would be useful for me. Anyone know any group that's attempted to quantify the global error rate?
What about using two models, with a smaller model used for this kind of negative reasoning?
Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!
Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.
That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.