The core problem: My friend is an accountant. Every tax season he manually ctrl+F's through dozens of client PDFs (T4s, RRSP receipts, bank statements) to extract numbers. I watched him spend 20 minutes finding one RRSP contribution total across 4 documents.
So I built this. Upload multiple PDFs, ask "what's the total RRSP contribution?" and get the answer with yellow highlights on the exact source text.
Technical stack: - Backend: FastAPI + pdfplumber for text extraction - PDF rendering: react-pdf with custom highlight overlay positioning - Chunking: Layout-aware splitting that respects tables and preserves bbox coordinates - Retrieval: Hybrid approach (BM25 + semantic embeddings + numeric normalization for currency/percentages)
The hard part was highlight precision. Early versions highlighted entire pages. Now it targets specific values (e.g., "$50,000.00") by extracting highlight_targets from the LLM response and matching them to chunk bboxes.
Free tier: 10 queries/month. Would love feedback from anyone who deals with multi-document PDF workflows.