LLMs believe false statements even after explicit warnings that they're false(arstechnica.com)

6 pointsby isaacfrond7 hours ago1 comment

Terr_7 hours ago
I'd like to make a little point by re-stating the title in an (admittedly awkward) way:
"When you have a story-document with a human-character telling an AI-character that something is false, an LLM run on that document may generate a new fragment where the AI-character has dialogue stating the thing is still true."
Yes, this takes the magic out of it, but that's the goal. Our instinct to psychoanalyze the mind of the character is a trap, because that mind never existed except in the imagination of the reader.
> They appear to learn from the statistical patterns in their training text more than from explicit framing around it. Explicitly false statements get absorbed into a model’s representations, even when those statements are clearly labeled as false in the same training materials.
Right, and to relate this back to my argument, it makes more sense (and is easier to anticipate) when we avoid confusing a real-world system with another system-character we "see" in the output.