Ask HN: How would you architect a RAG system for 10M+ documents today?

21 pointsby Ftrea3 days ago4 comments

osigurdson2 days ago
Are the documents individually large or fairly small - like a page or two each? If they are small docs since you already have Postgres, you can just add the pgvector extension determine what embeddings that you want to use and try it out without committing to much. Maybe add a hash column first so that you can avoid paying to compute the embeddings again if you decide to use a different approach. They are all basically doing the same math to find things so you aren't going to get magically better results with other things. If the docs are larger then you have to do chunking anyway.
Would the 10M documents be searched with a single vector search or would it be pre-filtered by other columns in your table first. If some prefiltering is happening it naturally make things faster. You will likely want to use regular text / tsvector based search as well and potentially feed the LLM with this as well since vector search isn't perfect.
You would then decide if you want to do re-ranking or not before handing it to the final LLM context window. These days, models are pretty good so they will do their own re-ranking to some extent but depends a bit on cost, latency and the quality of result that you are looking for.
- Ftrea10 hours ago
  This is extremely helpful. Our docs are indeed small (1-2 pages mostly), so distinct chunking might not even be needed—maybe one vector per doc or page. Since we are already on Postgres, pgvector + tsvector (for hybrid search) seems like the most logical MVP. Question: In your experience, does pgvector with HNSW indexes handle the 10M row scale with low latency (<200ms) for real-time chat? Or does a dedicated DB like Weaviate still offer a significant edge there?
parentheses3 days ago
If it's < 100M, with vectors of 1024 size, you could fit all of that in ~100G of memory. So, maybe storing it in memory is an easy way to go about it. This ignores a lot of "database problems". If the docs are changing constantly, or uou have other scalability concerns, you may be better off using a "proper" vector db. There have been HN postings which indicate vector db choice matters. Do your research there.
- Ftrea10 hours ago
  Agreed. Pure in-memory is too risky for us given the persistence requirements and monthly updates. We are definitely going with a 'proper' DB (likely Postgres+pgvector or Weaviate) to handle the state and updates reliably.
journal3 days ago
ranked hierarchical pagination and intermediate context control. also, text documents in database or text data in worth of 10 million documents? If you OCR, why not cache result? Also, Lucene white space tokenization is pretty good for dumb exact or close enough to get a filtered result that might fit the context windows better. imagine having to ocr and llm, instantly. i would do everything to avoid architecting a system like that. not sure if you're pointing the right end of the stick at the right problem. are you intending to max out your allowed context? what's going on here? you can usually extract rough set before you llm so ideally you'd never exceed 50% of context. How big of responses do you expect? you have a lot of options, just throw everything at the problem that's easy to implement first and see what sticks. make sure you got terminal access whereever you do this for max flexibility. i obviously prefer aspnet with psql. what kind of data do you need indexed? lets say you have something stupid like origin and destination based on locations and you need geo index and maybe zipcode database, and some intermediate step to calculate assets within radius, calc some distances and make a decision, adding geo to any problem is a nightmare, but fun, but only the first time. cause you know how to do it now but it takes so long you don't want to. if you have terminal and source you have enough space to maneuver updates, it'll end up being probably a one line to execute an update that takes some time to rebuild your solution and then it seems to automatically slide it under the working app i never experienced any problems. as for database schema changes, push out your production release to where the time between schema changes go down to less than 5% or something extreme but be aware there could be schema changes that are hard to implement even later, but after you're in production it's much harder.
- Ftrea10 hours ago
  Thanks for the tips. We are strictly doing offline processing (docs are already converted to Markdown stored in DB) to avoid any live OCR latency. Also 100% agreed on filtering—we plan to use metadata/keyword filters (Lucene style) to narrow down the search space before hitting the LLM context window. No intention to verify zipcodes though! :)
walpurginacht3 days ago
do you have an evaluation in place that necessitates complex stuffs? If not I'd start simple with proven stuffs and collect usage data to determine what's next
- Ftrea10 hours ago
  This is the sanity check we needed. We don't have a benchmark yet necessitating complex graph architectures. We will stick to 'Proven Stuffs' first: A solid Hybrid Search (Vector + Keyword) baseline. We'll collect usage data and only complicate the stack if the baseline fails on specific queries.