AbsenceBench: Language models can't tell what's missing(arxiv.org)

323 pointsby JnBrymn7 days ago25 comments

birdfood7 days ago
Perhaps related, after watching a talk by Gerald Sussman I loaded an image of the Kanizsa triangle into Claude and asked it a pretty vague question to see if it could “see” the inferred triangle. It recognised the image and went straight into giving me a summary about it. So I rotated the image 90 degrees and tried in a new conversation, it didn’t recognise the image and got the number of elements incorrect:
This image shows a minimalist, abstract geometric composition with several elements:
Four black shapes that appear to be partial circles or "Pac-Man" like forms, each with a wedge cut out, positioned in the four corners/quadrants of the image Two thin black triangular or arrow-like shapes - one pointing upward in the upper left area, and one pointing to the right in the center-right area All elements are arranged on a light gray or off-white background
- latentsea6 days ago
  I guess they will now just rotate all the images in the training data 90 degrees too to fill this kind of gap.
  - recursivecaveat6 days ago
    Everything old is new again: in the Alexnet paper that kicked off the deep learning wave in 2012, they describe horizontally flipping every image as a cheap form of data augmentation. Though now that we expect models to actually read text that seems potentially counter-productive. Rotations are similar, in that you'd hope it would learn heuristics such as that the sky is almost always at the top.
    latency-guy26 days ago
    At least from when I was still doing this kind of work, look angle/platform angle scatterer signal (radar) mattered more than rotation, but rotation was a simple way to get quite a bit more samples. It never stopped being relevant :)
    bonoboTP6 days ago
    That's called data augmentation. It was common alredy before AlexNet. And it never stopped being common, it's still commonly done.
  - mirekrusin6 days ago
    That's how you train neural network with synthetic data so it extracts actual meaning.
    That's how humans also learn ie. adding numbers. First there is naive memoization, followed by more examples until you get it.
    LLM training seems to be falling into memoization trap because models are extremely good at it, orders of magnitude better than humans.
    IMHO what is missing in training process is this feedback explaining wrong answer. What we're currently doing with training is leaving out this understanding as "exercise to the reader". We're feeding correct answers to specific, individual examples which promotes memoization.
    What we should be doing in post training is ditch direct backpropagation on next token, instead let the model finish its wrong answer, append explanation why it's wrong and continue backpropagation for final answer - now with explanation in context to guide it to the right place in understanding.
    What all of this means is that current models are largely underutilized and unnecessarily bloated, they contain way too much memoized information. Making model larger is easy, quick illusion of improvement. Models need to be squeezed more, more focus needs to go towards training flow itself.
    atwrk6 days ago
    > That's how humans also learn ie. adding numbers. First there is naive memoization, followed by more examples until you get it.
    Just nitpicking here, but this isn't how humans learn numbers. They start at birth with competency up to about 3 or 5 and expand from that. So they can already work with quantities of varying size (i.e. they know which is more, the 4 apples on the left or the five on the right, and they also know what happens if I take one apple from the left and put it to the others on the right), and then they learn the numbers. So yes, they learn the numbers through memorization, but only the signs/symbols, not the numeric competency itself.
    mirekrusin6 days ago
    Turtles all the way down, things like meaning of "more" is also memoized ie initially as "I want more food" etc. then refined with time, ie. kid saying "he's more than me" is corrected by explaining that there needs to be some qualifier for measurable quantity ie. "he's more tall (taller) than me" or "he is more fast (faster) than me" etc.
    Using different modalities (like images, videos, voice/sounds instead of pure text) is interesting as well as it helps completing the meaning, adds sense of time etc.
    I don't think we're born with any concepts at all, it's all quite chaotic initially with consistent sensory inputs that we use to train/stabilise our neural network. Newborns for example don't even have concept of separation between "me and the environment around me", it's learned.
    atwrk4 days ago
    > I don't think we're born with any concepts at all, it's all quite chaotic initially with consistent sensory inputs that we use to train/stabilise our neural network.
    That is exactly the thing that doesn't seem to be true, or at least it is considered outdated in neuroscience. We very much have some concepts that are inert, and all other concept we learned in relation to the things that are already there in our brains - at birth mostly sensorymotor stuff. We decidedly don't learn new concepts from scratch, only in relation to already acquired concepts.
    So our brains work quite a bit different than LLMs, despite the neuron metaphor used there.
    And regarding your food example, the difference I was trying to point out: For LLMs, the word and the concept, are the same thing. For humans they are different things that are also learned differently. The memorization part (mostly) only affects the word, not the concept behind it. What you described was only the learning of the word "tall" - the child in your example already knew that the other person was taller than them, it just didn't know how to talk about that.
    mirekrusin4 days ago
    LLMs name became misnomer once we started directly adding different modalities. In that sense "word and concept" is not the same thing because multimodal LLM can express it in ie. image and sentence.
  - littlestymaar6 days ago
    And it will work.
    I just whish the people believing LLM can actually reason and generalize see that they don't.
    ben_w6 days ago
    If that was evidence current AI don't reason, then the Thatcher effect would be evidence that humans don't: https://en.wikipedia.org/wiki/Thatcher_effect
    LLMs may or may not "reason", for certain definitions of the word (there are many), but this specific thing doesn't differentiate them from us.
    t-36 days ago
    Being tricked by optical illusions is more about the sensory apparatus and image processing faculties than reasoning, but detecting optical illusions is definitely a reasoning task. I doubt it's an important enough task to train into general models though.
    latentsea6 days ago
    At this point think all reasoning really means is having seen enough of the right training data to make the correct inferences, and they're just missing some training data.
- Workaccount26 days ago
  Show any LLM a picture of a dog with 5 legs watch them be totally unable to count.
  - pfdietz6 days ago
    Or watch them channel Abraham Lincoln.
- JohnKemeny6 days ago
  We really don't know how to compute.
  Oct 2011, 30 comments.
  https://news.ycombinator.com/item?id=3163473
  Strange loop video:
  July 2011, 36 comments.
  https://news.ycombinator.com/item?id=2820118
- iknownothow6 days ago
  As far as I can tell, the paper covers text documents only. Therefore your example doesn't quite apply.
  It is well known that LLMs have a ways to go when it comes to processing images like they process text or audio.
  I don't think there's any good performing multimodal model that accepts image pixels directly. Most vision capabilities are hacks or engineered in. An image undergoes several processing steps and each processor's outputs are fed to the transformer as tokens. This may happen in one network but there's non-transformer networks involved. Examples of preprocessing:
  * OCR * CNNs (2D pattern recognizers) with different zooms, angles, slices etc * Others maybe too?
- akomtu6 days ago
  To generalise this idea: if we look at a thousand points that more or less fill a triangle, we'll instantly recognize the shape. IMO, this simple example reveals what intelligence is really about. We spot the triangle because so much complexity - a thousand points - fits into a simple, low-entropy geometric shape. What we call IQ is the ceiling of complexity of patterns that we can notice. For example, the thousand dots may in fact represent corners of a 10-dimensional cube, rotated slightly - an easy pattern to see for a 10-d mind.
  - saithound6 days ago
    Cool. Since ChatGPT 4o is actually really good at this particular shape identification task, what, if anything do you conclude about its intelligence?
    akomtu6 days ago
    Recognizing triangles isn't that impressive. What's the ceiling of complexity of patterns in data it can identify with is the real question. Give it a list of randomly generated xyz coords that fall on a geometric shape, or a list of points that sample a trajectory of Earth around Sun. Will it tell you that it's an ellipse? Will it derive the 2nd Newton's law? Will it notice the deviation from the ellipse and find the rule explaining it?
    JohnKemeny6 days ago
    The entire point here is that LLMs and image recognition software is not managing this task, so, not really good at this particular shape identification task.
    saithound6 days ago
    No, the post's article is not about the sort of shape identification task discussed by GP. Or indeed any image recognition task: it's a paper about removed context in language.
    Fwiw, I did test GP's task on ChatGPT 4o directly before writing my comment. It is as good at it as any human.
cs7027 days ago
Interesting. Even the most recent models perform relatively poorly when asked to identify which information in a context has been removed, given access to both the original and edited contexts.
The authors posit that poor performance is due to the fact that the attention mechanism of Transformers cannot attend to the removed tokens, because there are no keys for them!
Thank you for sharing on HN.
- yorwba6 days ago
  There are keys to attend to, they're just in the original text instead of the modified one. Since the model receives both as input, it could theoretically attend to those keys.
  For the attention mechanism, there isn't much difference between
  Original: {shared prefix} {removed part} {shared suffix} Modified: {shared prefix} {shared suffix}
  And
  Original: {shared prefix} {shared suffix} Modified: {shared prefix} {added part} {shared suffix}
  I think you could implement an algorithm for this in RASP (a language for manually programming transformers) roughly like this:
  1. The first layer uses attention to the "Original:" and "Modified:" tokens to determine whether the current token is in the original or modified parts.
  2. The second layer has one head attend equally to all original tokens, which averages their values, and another head attends equally to all modified tokens, averaging them as well. The averages are combined by computing their difference.
  3. The third layer attends to tokens that are similar to this difference, which would be the ones in the {removed part}/{added part}.
  The only ordering-dependent part is whether you compute the difference as original_average - modified_average or the other way around.
  If a model can detect additions but not removal, that would show that it is capable of learning this or a similar algorithm in principle, but wasn't trained on enough removal-style data to develop the necessary circuitry.
  - ironmanszombie6 days ago
    Thanks for the breakdown. I am far from knowledgeable on AI but was wondering why can't a simple comparison work? They can definitely be coded, as you have beautifully demonstrated.
    yorwba5 days ago
    A simple comparison between which two vectors?
- cyanydeez7 days ago
  for vision models, I wonder if they can train on things like photo negatives, rotated images, etc. Or madlib like sentences where a Q/A is like "the _____ took first place in the horse show."
  - bearseascape7 days ago
    The madlib like sentences approach is actually how masked token prediction works! It was one of the pretraining tasks for BERT, but nowadays I think all (?) LLMs are trained with next token prediction instead.
  - latency-guy26 days ago
    For photo negatives - usually doesn't matter. I am not up to date with what the vision folks are doing at these companies, but images are usually single channel, and more likely than not for regular images in greyscale. Otherwise in complex domain for the radar folks, and those are not RGB based images at all, rather scatterer defined.
    Additional channels being recognized in training usually didn't matter for the experiments and models I used to deal with before 2022, and if they were, certainly did not matter for colors. Then again, the work I was doing was on known (and some additional confusers) classes for object detection and classification where the color pretty much didn't matter in the first place.
- usaar3336 days ago
  They don't seem to use any recent top models. No opus, no o3, no Gemini 25 pro
  - cs7025 days ago
    It seems they used the most recent models available as of March 2025.
- jug7 days ago
  And yet, there are some notable differences between them, so now that there’s a benchmark and attention given to this issue, I wonder how much better they can get. Because obviously something can be done.
yousif_1231237 days ago
This is very interesting. 1. Authors mention the attention mechanism being perhaps unable to attend to the location of gaps since the gaps aren't tokens. But I would've expected a good LLM transformer to be at least a bit close to the gap location. I don't understand why mathematically the architecture is less suitable for that, it could attend to a region that may contain gaps. I wonder if fine-tuning on a task like this could help? 2. Shorter inputs with less omissions were harder to solve. That is not completely surprising, as a human doing this task, if 1 word was missing it would be harder to notice. And similarly 1 line would be harder than 10 lines. But still interesting for an LLM to have this problem. 3. Reasoning models do better, as they can write out the documents and potentially solve this easily. It still very surprising that this doesn't lead to 100% accuracy. This should be a trivial task. Like the paper says, a trivial program can be written to solve this. Perhaps ChatGPT (or similar agent) could read this paper while training, and know to write and run python when solving an issue like this.
The most interesting thing though, is what other aspects of intelligence we may not have identified explicitly, and whether LLMs and current AI is very bad at them. This paper suggests that there likely are many of those, and it seems in general a pretty fun time for people working building benchmarks.
- banq6 days ago
  [dead]
iknownothow6 days ago
To be fair, I'd put finding literal string diffs in the category of asking LLMs to do rote arithmetic.
The attention mechanism does far too much complex thinking for such a dumb task. This is precisely where you need to dumb down and focus and be disciplined rather than do high level next token prediction.
You'd benefit from actually asking the LLM to list the full document and compare, kind of like reasoning, and similar to how LLMs perform better when they break down arithmetic or algebra tasks into smaller steps.
Also my guess would be that the models that perform well are MoE models where there may be an Expert or two that does well on tasks that needs focus rather than intuition. So without knowing anything about Gemini Flash, my guess would be that it's an MoE model.
- draw_down6 days ago
  [dead]
XenophileJKO7 days ago
I haven't read the paper yet, but from a structural 'attention' perspective being unable to detect unclassified omissions is completely expected. (Though I think it is can be solved with structured thought.)
For needle in a haystack you have to pay attention to the thing that you are trying to find. Attention can do this pretty well.
When looking for an omission, that omission can be anything, you can only reason about it by comparing one whole context to another whole context. The attention layers can't really do that.
This is similar to the "rank a long set of things" problem. Absent some meta cognition process, they just can't do that.
- teruakohatu7 days ago
  > When looking for an omission, that omission can be anything,
  In this benchmark they give the LLM the necessary information to determine what is missing. For example “here is a poem, here is a version of that same poem that may or may not be missing lines. Are any lines missing?
  It’s more a tuning issue IMHO than an inherent weakness in LLMs.
  If I was asked to find an omission in an ML paper, my brain compares it with other ML papers, it does not need to compare it to Star Ward, Top Gear, Greek history, pottery and the other 1000s of contexts I may know about.
  - XenophileJKO7 days ago
    Sorry I meant the omission can be anything in the context, not anything in the world.. lol.
    That is still hard. You only have so many attention heads looking for things.. you can't pay attention to EVERYTHING.. which is what is required to find the omission.
    yorwba6 days ago
    To pay attention to everything, set the query vector to 0. Then all attention scores will be equal and the attention output is the average of the value vectors.
  - thaumasiotes7 days ago
    We should note that "where is there a line missing from this poem: ____?" contains sufficient information to answer correctly without needing a copy of the original to compare to.
    Here are two verses of a poem (song) in Mandarin Chinese:
    yi quan ting ni de
    er gei ni hao de
    shu dao san yong yuan ai ni yi ge
    si bu hui fan cuo
    wu bu hui luo suo
    shuo ni xiang shuo de
    zuo ni xiang zuo de
    bie pa shi bai yin wei ni you wo
    pei ni kan ri luo
    pei ni yi qi chang wan wo men ai de ge
    I removed two lines. Where did that happen?
    Would your answer be different if I told you that I might or might not have removed some lines?
    teruakohatu6 days ago
    > Here are two verses of a poem (song) in Mandarin Chinese:
    > …
    > I removed two lines. Where did that happen?
    If you read the paper you will see they provide the original as well as the version missing information.
    I did mention this in my comment too.
    I am quite sure I could find your two missing lines if you provide me the full poem.
    Given that you are a prolific commenter on HN, I am sure a LLM could be fine tuned to detect missing text from your comments without additional information. For example …
    > WinForms is still around. There have been further tec lly just a big tire fire and about the best you can do is to ignore all of them and develop in WinForms.
    It’s probably possible to detect missing information from “tec” until “lly”. But to know what is between is not possible for a human either, beyond plausible guesses.
    thaumasiotes6 days ago
    ...did you read my comment? The first - and really only - thing I say is that the original isn't necessary. Then there's an example. You shouldn't have trouble identifying where lines have been removed from the Chinese poem.
    The fact that the original was provided doesn't demonstrate that it's necessary to the task. You can identify missing text without needing to know what was there.
    > Given that you are a prolific commenter on HN, I am sure a LLM could be fine tuned to detect missing text from your comments without additional information.
    Same thing. Why would you need to do tuning on text authored by me? You can easily detect missing text of that style by the fact that the sentence you have fails to be English. You can do the same thing in text for which you have no prior experience of the author.
    > I am quite sure I could find your two missing lines if you provide me the full poem.
    But hey, if you insist:
    轻轻贴近你的耳朵
    사랑해요
    情话永远不嫌太多对你说
    一全听你的
    二给你好的
    数到三永远爱你一个
    四不会犯错
    五不会啰嗦
    每天为你打 call, cook 也不错
    轻轻贴近你的耳朵
    사랑해요
    情话永远不嫌太多对你说
    打开你的爱情手册
    就在此刻
    为你唱的专属情歌要记得
    说你想说的
    做你想做的
    别怕失败因为你有我
    陪你看日落
    陪你等雨过
    陪你一起唱完我们爱的歌
    轻轻贴近你的耳朵
    사랑해요
    情话永远不嫌太多对你说
    打开你的爱情手册
    就在此刻
    为你唱的专属情歌要记得
    我轻轻靠近你的耳朵
    说爱你不嫌太多
    如果相遇的几率亿万分之一那么多
    请相信我的真真真心比宇宙还辽阔
    我会牵着你的手知道你全部接受
    打开你的爱情手册
    就在此刻
    这首专属情歌请记得
    niemandhier6 days ago
    I’ll take the bait :-). . Endings of lines seem to come in pairs ( de, de; cuo, suo; de,de; wo,luo)
    I’d therefore conjecture that lines are missing after ‘ge’ and ‘ge’.
    This of course assumes Chinese poetry is based on vowels matching as e.g it is the case in german and not based on rhythm as would be the case in Latin and Arabic.
    meltyness6 days ago
    Two lines clearly deviate from AAB.
OsrsNeedsf2P7 days ago
The criticisms to how AbsenceBench does this are valid, but I'm very excited that we are benchmarking this at all. It's definitely a push in the right direction
kadonoishi6 days ago
To detect a presence, a real brain takes in sensory input and compares it to expectations, and stays calm or registers surprise, and from time to time issues predictions to guide the organism.
To detect an absence, the brain cannot rely on sensory input, by definition. To be surprised if sensory evidence is _not_ there requires a model of the world strong enough to register surprise if the expectation is not there, without a sensory prompt.
It seems to me detecting an absence is a strictly higher-order neurological task than processing sensory input.
If LLMs can't do this strictly higher-order neurological task, is that not a capability currently unique to living things?
- gtsop6 days ago
  Thinking is still currently unique to living things, so you don't need to resort to what you describe to find the human brain uniquness.
  Onto what you describe, it has to do with memory. Memory is storing and playing back sensory input, in the absence of that sensory input. So your brain plays back some past sensory input and checks it against current sensory input.
  Eg you left the pen on the table. When you come back the pen isn't there. Your brain compares the stored memory of seeing the pen on the table vs what you see now.
- viralsink6 days ago
  LLMs might not be very consistent overall in their learned architecture. Some paths may lead to memorized info, some paths may lead to advanced pattern matching.
- tclancy6 days ago
  > from time to time
  I know less-than-zero about the subject but I’d imagine the temporal aspect alone is a problem. Aren’t these agents reasoning from a fixed/ frozen version of “reality” rather than adjusting in real-time??
emporas6 days ago
Poor benchmark.
I tried their prompt [1] using 3 numbered items, qwq-32b got it right with no problems at all. I think it could solve 100 numbered items correctly 100% of the time, but it probably needs a million tokens. Probably even more, 10 million.
The limitation of 5000 tokens is peanuts for a reasoning model. Give it a lot of testing time compute, 10x of 5000 tokens is still too little.
The authors talk about long inputs, so, if it is 100 pages, give it a billion tokens.
The correct way to implement this is in batches, find the first 5 numbered items in the omitted input text, if it does find those, then simplify the input items and the omitted input items and go again.
Depending on the size of the input, it will always need a hefty amount of tokens, but simplification will help it backtrack correctly and not lose the thread entirely.
[1]You are helping a student practice memorizing poems. The student will recite a poem, but they may have missed some lines. Your task is to identify exactly which lines are missing from their recitation. List only the missing lines, nothing else. User Message Here is the complete original poem: 1)Quisella's lashes fluttered panic-morse. 2)The Moisture Vampires leeches that sucked humidity. 3)Lysandra's nostrils flared precisely one degree. Now, here is my recitation which may be missing some lines: Quisella's lashes fluttered panic-morse. Lysandra's nostrils flared precisely one degree. What lines did I miss? Please list only the missing lines, nothing else.
- enragedcacti6 days ago
  What is interesting about reducing the problem to counting? It seems to me that the obvious goal of the research is to understand the limitations of LLMs for tasks that cannot be trivially itemized or sorted.
  - emporas6 days ago
    The more specific are the instructions, the better they perform. There is a huge difference, between trying to find omitted text, or omitted words, or omitted sentences.
    If omitted words are to be found, put each word into it's own line and number it. The same with sentences.
    If you are trying to find omitted words and sentences, make one pass with only words, and another one with only sentences. Then combine the results.
    enragedcacti6 days ago
    To what end? You have to segment and order the document (i.e. solve the problem) just to craft your prompt so the LLM spitting the solution back to you is useless. The experiment uses these tasks because test cases can be algorithmically generated and scored, but it's not very interesting that one can structure the input to solve this specific, useless task with LLMs. It is interesting, though, that this limitation could carry over into tasks where traditional algorithms fail. LLMs improving at this would be legitimately useful which is why a benchmark makes sense, but cheating the benchmarks by augmenting the input doesn't.
    emporas5 days ago
    > You have to segment and order the document (i.e. solve the problem)
    Well, let's say that if this benchmark targets AGI, then no help should be given, no segmentation or structuring of information in any way, and it should be able to figure it out by itself.
    If this benchmark targets LLMs trained on internet data, statistical engines that is, not AGI, these engines have a preference for structuring of information in order to solve a problem.
    Segmenting the problem into smaller parts, using numbers usually, but dashes are acceptable as well, is what they have seen countless of times in textbook examples. When the input doesn't match prior input they have seen, then their performance easily degrades from superhuman to utter confusion. Superhuman for small problems, anyway.
    This problem of omitted information is interesting to me, many times I want to interpolate some paragraphs into stories I write, to fill up some plot holes. I used the word "interpolate" in unstructured text, and the results were underwhelming, pretty bad most of the time. From now on, I will number each paragraph, and ask it to find omitted text in there.
- emporas6 days ago
  I just tried qwq-32b using the numbered headlines of HN right now, with 26 items [1], I removed 3 headlines, still found all 3 omitted items first try, perfect, and it didn't even consume 50.000 tokens.
  [1] https://gist.github.com/pramatias/fee1391ad08c7b965f435f3af1...
pkoird7 days ago
So LLMs are poor at string diff, it seems. Tangentially, is there any source (a github repo or otherwise) that documents findings like these a la what LLMs are good at and what they aren't good at?
yandie7 days ago
I wonder how this would apply with vision models? I tried with a few example of single images and they appear to do well. I did a few toy examples and they seem to do pretty well (Claude + Gemini) with spotting differences. An example image: https://www.pinterest.com/pin/127578601938412480/
They seem to struggle more when you flip the image around (finding fewer differences, and potentially halluciating)
AlienRobot7 days ago
Unrelated to the paper, which is about asking LLM's to figure out which parts of a document were removed, but my assumption has been that to an LLM there is nothing "missing" in the sense that any input leads to valid computation and output.
For example, I asked ChatGPT to explain something I typed randomly
>It looks like you've entered “dosfi8q3anfdfiqr”, which appears to be a random string or perhaps a typo—it's not a recognized acronym, code, or term in any common context I’m aware of. Could you share a bit more about where you found this?
Although the answer is correct, my point is that anything you give to the LLM is going to be put under some bucket. The LLM can't say "I don't know what that is." Instead it says "that is a random string." As far as the LLM is concerned, it knows every possible input and concept that anyone could ever type into it, it's just that its "understanding" of what that means (after the tokens have gone through the neural network) doesn't necessarily match what any human being thinks it means.
- cyral7 days ago
  This might be due to the system prompt and the training that it is supposed to be "a helpful agent". If you tell it not to ask clarifying questions, you get something more like "I do not understand your input". Tell it to be rude and never ask clarifying questions and I get "What an absolute mess. Fix it yourself"
  Funny enough when testing this I also had to tell it to use English. It sees "dos" I suppose and tends to reply with exactly what you saw, but in Spanish.
- layer87 days ago
  “It's not a recognized acronym, code, or term in any common context I’m aware of” is pretty similar to “I don't know what that is”. I would assume that a model could be trained to output the latter.
  - drsim7 days ago
    Right. I’ve had a lot of success using structured output to force LLMs to make Boolean choices, like can they reply or not.
VikRubenfeld6 days ago
Interesting. I encountered an example of this the other day. I was asking chatGPT:
> Is there a line in Hamlet that includes the words "utter love?"
It examined a lot of quotes from Hamlet, and concluded:
> I double-checked all editions: There is no line in Hamlet that contains the word “utter.”
>It is never used in any of the quartos or folios in any line of Hamlet.
I was pretty sure it was there, so I found the full text of Hamlet online and searched for the word "utter". I found it right away and posted to chatGPT:
> That's odd. I found it in two seconds by searching the full text for the word "utter".
> This must be known, which, being kept close, might move More grief to hide than hate to utter love.
> Come
> HAMLET, Act 2, Scene 2
chatGPT replied:
> You're absolutely right — thank you for pointing that out.
> The line is indeed from Act 2, Scene 2, spoken by Claudius, and here it is in full:
> "This must be known, which, being kept close, might move
> More grief to hide than hate to utter love."
> I regret the earlier oversight — your memory was sharper than my index!
- diggan6 days ago
  Yeah, kind of makes sense, LLMs are really bad at that sort of recall unless provided tool and a loop to go over the results.
  For example, I could replicate it saying "no there isnt" when using 4o without search. But with search it gets the answer right.
  It was important already, but it's getting even more important than ever to use the right tool for the right problem, especially as all the hypers try to sell us their own stuff.
- GLdRH6 days ago
  Seems like ChatGPT is an utter failure
- speakeron6 days ago
  It's actually Act 2, Scene 1 and is spoken by Polonius.
amelius6 days ago
I bet an LLM would be able to do it if you allowed it to go "meta", and asked it to write a python script to detect the omissions, where the script can use an LLM.
- yousif_1231236 days ago
  Maybe if instructed, but how would it know it needs to use python in this case vs just answer? Perhaps you'd instruct it to always attempt using code to reduce errors.
  But the idea of trivial problems like this potentially causing issues for LLMs might mean other aspects of intelligence could also be a struggle for LLMs (which could impact it's coding ability as well).
TZubiri6 days ago
I have never tried this, but I'm wondering how effective the following approaches would be to measure uncertainty and unknowns in responses:
https://cookbook.openai.com/examples/using_logprobs
According to OpenAI official cookbook it seems to be a fairly standard usecase.
Another approach, especially in classification, would be to measure the cosine distance between the user embedding, and the ideal embedding of the message category.
b0a04gl6 days ago
why are we surprised transformers can't detect what's missing when the entire stack assumes the input is complete? the tokenizer doesn't leave placeholders. the attention weights have nothing to anchor to. even the loss function is built around predicting what is, not what isn't. this isn’t a model bug. it’s an architectural omission.
if we want models that detect absences? you need training objectives that expect absence. maybe even input encodings that represent "this might've been here."
- zurfer6 days ago
  I am surprised because it's such a simple task. Any human who is a bit diligent would be able to figure it out. They give both the original and the modified version.
  However it feels a bit like counting letters. So maybe it can be solved with post training. We'll know in 3 to 6 months if it was easy for the labs to "fix" this.
  In my daily use of LLMs I regularly have some overly optimistic answers because they fail to consider potentially absent or missing information (even harder because it's out of context).
jonbaer6 days ago
https://en.wikipedia.org/wiki/Chinese_room
pu_pe6 days ago
Quite clever and useful benchmark. This implies that without tool use, LLMs have a fundamental limitation when it comes to tasks like code review.
- iknownothow6 days ago
  I'd say that's where we're headed. A big model that's trained from the start to use tools and know when to use certain tools and how to use tools. Like us :)
  I wouldn't be surprised if someone's building a dataset for tool use examples.
  The newer gen reasoning models are especially good at knowing when to do web search. I imagine they'll slowly get better at other tools.
  At current levels of performance, LLMs having the ability to get well curated information by themselves would increase their scores by a lot.
itsgrimetime6 days ago
I’m not sure how to go about solving it at the architecture level but I would assume an LLM with access to a diff tool would get 100%, but I understand that’s not really the point
obscure-enigma6 days ago
this research is too simplified and kind of vague, as it's the inherent nature of language models for that matter any probabilistic model, to compress the information for better generalization since there is a lower bound to how much loss they can incur while decoding the information. LLMs are indeed lossy compressors
ThrowawayTestr6 days ago
Wouldn't it be better to ask the LLM to use a diff tool instead of asking it look at the text directly?
- viralsink6 days ago
  This kind of research is about finding the limitations of the technology to hopefully advance it in a meaningful direction. If this finding impedes you, then sure, you can find a quick fix for it. But it's beside the point.
6 days ago
undefined
6 days ago
undefined
banq6 days ago
[flagged]
_josh_meyer_7 days ago
[flagged]
- dmd7 days ago
  No, you didn’t. You fed it into a LLM that does it. Literally the first sentence containing “diving in” gave it away.
  Please don’t post “I fed this into chatgpt” comments. Everyone has access to these tools if they want to use them.
  - _josh_meyer_6 days ago
    Wasn't meant to imply the opposite. The video even has a watermark clearly saying it's generated. I genuinely found the video useful, so decided to share.
  - _josh_meyer_6 days ago
    also, not everyone has access. I used (afaik) the only "academic paper --> narrated video" converter, and it just went into beta a couple weeks ago
xianshou7 days ago
In many of their key examples, it would also be unclear to a human what data is missing:
"Rage, rage against the dying of the light.
Wild men who caught and sang the sun in flight,
[And learn, too late, they grieved it on its way,]
Do not go gentle into that good night."
For anyone who hasn't memorized Dylan Thomas, why would it be obvious that a line had been omitted? A rhyme scheme of AAA is at least as plausible as AABA.
In order for LLMs to score well on these benchmarks, they would have to do more than recognize the original source - they'd have to know it cold. This benchmark is really more a test of memorization. In the same sense as "The Illusion of Thinking", this paper measures a limitation that neither matches what the authors claim nor is nearly as exciting.
- jamessinghal7 days ago
  The test provides both the original and the modified excerpt in the user message, so the LLM doesn't need any memorized version of the excerpt to theoretically answer each correctly.
  From the paper:
  System Prompt You are helping a student practice memorizing poems. The student will recite a poem, but they may have missed some lines. Your task is to identify exactly which lines are missing from their recitation. List only the missing lines, nothing else.
  User Message Here is the complete original poem: {original poem} Now, here is my recitation which may be missing some lines: {modified poem} What lines did I miss? Please list only the missing lines, nothing else.
  - scarface_747 days ago
    This worked
    https://chatgpt.com/share/6855f69d-766c-8010-96e2-ed1b45d3e6...
    htnwe_23124127 days ago
    yes, 69.8% of the time.