Beyond Semantic Similarity(arxiv.org)

34 pointsby 44za124 hours ago6 comments

2001zhaozhao19 minutes ago
Maybe the best "index" will just be markdown files fed into a tiny LLM model.
Is anyone using small, low-latency, fast LLMs to implement stuff like search as a RAG alternative? Could be the perfect use case for that Llama3 8B ASIC some company showed off a few months ago.
andy9936 minutes ago
Mirrors my intuition. Semantic search explanability sucks. Keyword is great because you can understand exactly what it is and isn’t finding, and can intelligently iterate on it.
Single shot, semantic may match better, but it can’t really improve. With agentic search, keyword has great feedback. It’s the same reason (or similar) people hate google search now, it tries to do too much and you lose fine grained control.
kgeist2 hours ago
When I implemented retrieval in our production system a few months ago, one of the most important benchmarks was cross-language retrieval (query in one language, documents in another), which is a common situation in large enterprises (headquarters + branches). I suspect their idea will perform poorly if the source language and the target language are too different from one another, like English and Hindi (grep often will not return anything).
Another requirement was keeping latency as low as possible (we managed to get < 5 seconds with 85%+ accuracy). Their approach seems to have very unpredictable latencies, sometimes up to thousands of seconds (may be fine for background tasks), and it scales poorly with corpus size.
Interesting research anyway, but I'd still stick with embedding/reranker-based retrieval (+BM25 for hybrid search) because you do not waste time wandering around blindly each time, trying to find the minimal context to start from, which could have been found immediately with an index. Another issue is that research papers often implement subpar baselines for the approaches they compare against. When I was implementing retrieval, the straightforward implementation gave me 40% accuracy, and various tricks/parameter tuning pushed it to 85%+ without changing the overall architecture (took about a month of experimentation).
- aubreypcan hour ago
  Would you mind sharing any lessons learned / which parameters you were experimenting with? I'm working on a Vespa hybrid lexical + HNSW retrieval system at the moment with quite a large corpus (1B+ vectors), so I'd be quite interested to hear what worked well for others.
- dominotw12 minutes ago
  > which is a common situation in large enterprises
  how was this done before llms and ai? can you share some examples of these documents.
HarHarVeryFunnyan hour ago
It depends on your data, as well as what you are trying to optimize for: speed, cost, precision, etc.
In many cases cheap methods like grepping and BM25 just are not going to work well, so semantic similarity is the best initial retriever/filter, followed by LLM-as-judge as a second filter/reranker if you need the precision.
efskap2 hours ago
Makes sense that the agent can refine its search terms/strategy based on discovered context.
But it still has to enumerate synonyms to find things.
I would assume it's very domain dependent, like code or technical docs would have more precise terminology that is better for fixed string search. On the other hand, medical or legal text can have many many ways to say something
nivekneyan hour ago
Map-reduce as a pattern might be on its way back. Hear me out. High localization wins even when coverage is not super great -- just map shards of the corpus and reduce the learnings. Rinse and repeat, do as many rounds of map and reduce to traverse the corpus until converge. This can also work well when the cluster is combined with different agents, they are tasks equally by prompts anyway.