3 pointsby colon-md5 hours ago1 comment
  • colon-md5 hours ago
    Last week, I read Karpathy's gist on building a personal wiki LLM (https://gist.github.com/karpathy/442a6bf555914893e9891c11519...) and decided to try it.

    The RAG pitch is take your own corpus of docs, layer an LLM over it, get a thing that answers questions grounded in your stuff. Wiki+RAG hybrid as the interesting architectural variant.

    So I started building the "traditional" retrieval architectures (pure dense, BM25, hybrid RRF, rerank) to pit against the wiki+RAG variant with structure layered over the chunks.

    After few days of code cleanup I have an eval testbench and a wiki LLM is only 50% built. I'm releasing the testbench now because I think the testbench is just as valuable as the RAG design itself.

    What the repo does: runs four hosted RAG services against identical inputs (same 81-doc enterprise corpus, same 50 questions stratified across single-hop / multi-hop / contradiction / unanswerable, same retrieve-only scoring of 0.7×recall + 0.3×precision):

      - Azure AI Search: 84.0  (recall 90.9%, precision 67.8%)
      - Vertex AI RAG Engine: 82.6  (94.5%, 54.7%)
      - Bedrock Knowledge Bases: 82.5  (87.9%, 70.1%)
      - OpenAI File Search: 78.5  (89.3%, 53.4%)
    
    Here's a surprise finding (maybe not a surprise to you): all four major RAG services hallucinate on every unanswerable question. 0/5 abstention correctness across the board. Was sort of expecting enterprise RAG providers like GCP, AWS, Azure, and OpenAI to respond "I don't know" to unanswerable questions.