2 pointsby Yigtwx7 hours ago1 comment
  • Yigtwx7 hours ago
    Hi HN,

    I’m a 3rd-year software engineering student. Every semester, I waste hours trying to find specific rules, passing grades, or course details buried inside my university’s endless, poorly formatted PDFs. I just wanted a fast, completely offline way to search through them, so I built this.

    Instead of jumping straight into a heavy LLM or RAG setup, I decided to keep it simple and lightweight. The backend is built with FastAPI. I used `pdfplumber` to extract the text from the PDFs (which is a nightmare on its own) and implemented BM25 for the core search engine.

    It works completely offline and handles Turkish text surprisingly well for a pure retrieval system. It does exactly what I need without the latency, hardware requirements, or hallucinations of running local AI models.

    It’s my first time properly using BM25 for a real-world problem, so the codebase might still be a bit rough.

    If anyone has war stories or tips on extracting clean text from terribly formatted academic PDFs, or ways to improve BM25 search relevance for a specific language without bloating the system, I’d love to hear them!