Especially with Gemini Pro when providing long form textual references, providing many documents in a single context windows gives worse answers than having it summarize documents first, ask a question about the summary only, then provide the full text of the sub-documents on request (rag style or just simple agent loop).
Similarly I've personally noticed that Claude Code with Opus or Sonnet gets worse the more compactions happen, it's unclear to me whether it's just the summary gets worse, or if its the context window having a higher percentage of less relevant data, but even clearing the context and asking it to re-read the relevant files (even if they were mentioned and summarized in the compaction) gives better results.
Long story short: Context engineering is still king, RAG is not dead
LLMs will need RAG one way or another, you can hide it from the user, but it still must be there.
The thing that would signal context rot is when you approach the auto-compact threshold. Am I thinking about this right?
It's actually even more significant than it's possible to benchmark easily (though I'm glad this paper has done so.)
Truly useful LLM applications live at the boundaries of what the model can do. That is, attending to some aspect of the context that might be several logical "hops" away from the actual question or task.
I suspect that the context rot problem gets much worse for these more complex tasks... in fact, exponentially so for each logical "hop" which is required to answer successfully. Each hop compounds the "attention difficulty" which is increased by long/distracting contexts.
The best results seem to be from clear, explicit instructions and plan up front for a discrete change or feature, with the relevant files to edit dragged into the context prompt.
Instead I have a good instance going, but the model fumbles for 20k tokens and then that session heavily rotted. Let me cut it out!
LLMs-as-a-service don't offer this because it makes it trivial to bypass their censoring.
I'm sure it's all my poor prompting and context, but it really seems like Claude has lost 30 iq points last few weeks.
Does this not feel like gaslighting we've all now internalized?
One paper that stood out to me a while back was Many-Shot In-Context Learning[1] which showed large positive jumps in performance from filling the context with examples.
As always, it’s important to test one’s problem to know how the LLM changes in behavior for different context contents/lengths — I wouldn’t assume a longer context is always worse.
ICL is a phenomenon separate from long-context performance degradation, they can coexist, similarly to how lost-in-the-middle affects the performance of examples in different positions just as fine.
It really depends on the task, but I imagine most real world scenarios have a mixed bag of requirements, such that it's not a needle-in-a-haystack problem, but closer to ICL. Even memory retrieval (an example given in the post) can be tricky because you cannot always trust cosine similarity on short text snippets to cleanly map to relevant memories, and so you may end up omitting good data and including bad data (which heavily skews the LLM the wrong way).
[1]: Coincidentally what the post author is selling
Built-in reasoning chain certainly helps in long-context tasks, especially when it's largely trained to summarize the context and deconstruct the problem, like in Gemini 2.5 (you can easily jailbreak it to see the native reasoning chain that is normally hidden between system delimiters) and DeepSeek R1-0528, or when you're forcing it to summarize with a custom prompt/prefill. The article seems to agree.
Media literacy disclaimer: Chroma is a vectorDB company.
I've noticed this issue as well with smaller local models that have relatively long contexts, say a 8B model with 128k context.
I imagined they performed special recall training for these long context models, but the results seem... not so great.
My hunch would be that even if we had a lot more annotated examples of reasoning and retrieval over 10,000+ tokens, the architectures we have today would still be limited.
Having a LLM recall something with exact detail some 100k tokens ago sounds a bit like the ADHD test Cartman got in South Park. We don't recall exactly but rather a summarized version.
On the other hand, computers recall exactly when asked directly (RAM access) so in that sense it seems natural to want that from a LLM.
One thing we can do which current LLMs can't, at least directly as far as I'm aware, is to go back and re-read a section. Like on-demand RAG, or something.
In the meantime, good to know it's not terribly useful to have the full 128k context, as it usually is too much for my GPU anyway.
Encoders can do that. And we can use them with diffusion to generate text [0].
This works because you don't impose a masked self attention for autoregressive decoding in the encoder, so subsequent layers can re-focus their key/query vector space to steer "backwards" information flow.
Happy reading. Feel free to get back!
https://www.notion.so/LLM-Context-Engineering-21b814d6a64980...
Some of these are in use in an in-house AI chat application that has a heavy emphasis on tool calls.
It may be that dimension-starved pretrained transformer models rely heavily on context being correctly "tagged" in all relevant aspects the very instant it's inserted into the KV cache, e.g. necessitating negation to be prefixed to a fact instead of allowing post-fix negation. The common LLM chat case is telling the model it just spewed hallucination/wrong claims, and hoping this will help instead of hurt downstream performance as the chat continues. There specifically the negation is very delayed, and thus not present in most tokens that code the hallucinated claims in the KV cache, and thus for lack of sufficient positional precision due to insufficient dimensionality, the transformer can't retroactively attribute the "that was wrong" claim in a retrievable matter to the hallucination tokens.
The result of course being the behavior we experience: hallucinations are corrected by editing the message that triggered them to include discouraging words, as otherwise the thread will become near-useless from the hallucination context pollution.
I do wonder if we have maybe figured out how to do this more scalable than just naively raising the query dimension to get (back?) closer to sequence length.