Queries hit the same embedding model, and the system pulls the top-K most similar chunks by cosine distance. There's a hybrid search mode too: it over-fetches 5x candidates from the vector index in parallel with keyword search (full-text search via a GIN-indexed tsvector column, falling back to trigram ILIKE when FTS returns few results). Results are merged using a slot reservation system: 60% of the final top-K comes from vector results ranked by cosine similarity, with up to 40% reserved for keyword-only matches that the vector search missed. Retrieved chunks get stuffed into a prompt with source metadata and sent to Claude Sonnet or GPT-4o with instructions to cite sources in bracket notation.
On the backend, pub-sub workers handle the indexing pipeline: text extraction, chunking, batch embedding in groups of 100, and firing off face detection through AWS Rekognition on images pulled from PDFs (very good in some cases, not so much in others). The query endpoint is free with some rate limiting, but also sits behind x402 micropayments ($0.50) that bypass rate limits when valid (it's not cheap to run these queries as of now). There's also an MCP server so AI agents can query directly as a tool.
Built with the help of Claude, so some of the tech (RAG via LLM, pgvector, etc.) is newish to me. Was a fun exercise!