Whenever RAG answers felt wrong, my instinct was always to tweak the model: embeddings, chunking, prompts, the usual.
At some point I looked closely at what the system was actually retrieving and the actual corpus its based on - the content was quite contradictory, incomplete in places, and in some cases even out of date.
Most RAG observability today focuses on the model, number of tokens, latency, answer quality scores, performance, etc. So I set out on my latest RAG experiment to see if we could detect documentation failure modes deterministically using telemetry. Track things like:
- version conflicts in retrieved chunks - vocabulary gaps on terms that don't appear in corpus, = knowledge gaps on questions the docs couldn't answer correctly - unsupported feature questions
So what would it be like if we could actually observe and trace documentation health, and potentially use it to infer or improve the documentation?
I wrote up the experiment in more detail here on Substack.
I’m actually curious: has anyone else noticed this pattern when working with RAG over real docs, and if so, how did you trace the issue back to specific pages or sections that need updating?