Happy to answer questions. The core idea: generate a golden QA dataset from your docs, then use LLM-as-judge to score your RAG's answers. Works fully offline with Ollama — confirmed end-to-end on Colab free tier yesterday. The 1.5b model is just for the demo pipeline; use llama3.1:8b (or larger) locally for production-quality results.