Show HN: An offline document search engine for my university's messy PDFs(github.com)

2 pointsby Yigtwx7 hours ago1 comment

Yigtwx7 hours ago
Hi HN,
I’m a 3rd-year software engineering student. Every semester, I waste hours trying to find specific rules, passing grades, or course details buried inside my university’s endless, poorly formatted PDFs. I just wanted a fast, completely offline way to search through them, so I built this.
Instead of jumping straight into a heavy LLM or RAG setup, I decided to keep it simple and lightweight. The backend is built with FastAPI. I used `pdfplumber` to extract the text from the PDFs (which is a nightmare on its own) and implemented BM25 for the core search engine.
It works completely offline and handles Turkish text surprisingly well for a pure retrieval system. It does exactly what I need without the latency, hardware requirements, or hallucinations of running local AI models.
It’s my first time properly using BM25 for a real-world problem, so the codebase might still be a bit rough.
If anyone has war stories or tips on extracting clean text from terribly formatted academic PDFs, or ways to improve BM25 search relevance for a specific language without bloating the system, I’d love to hear them!