Show HN: Open-source Deep Research across workplace applications(github.com)

125 pointsby yuhongsun4 months ago15 comments

vishnudeva4 months ago
Onyx is as close as to magic you can get in this space. It just. Works.
I can talk for literally hours about how good it is when you connect it to your company's Confluence or Jira or Slack or Google Drive or a ton of other things. At a scale of many tens of thousands of documents too.
Their team is awesome too and completely tuned into exactly what their users need. And that it's open source is the cherry on top. No secret in how your data is being used.
- An incredibly happy user looking forward to more from Onyx
- yuhongsun4 months ago
  Amazing to hear from a happy user! Thanks for the kind words!
yuhongsun4 months ago
Before sharing how it works, I want to highlight some of the challenges of a system like this. Unlike deep research over the internet, LLMs aren’t able to easily leverage the built in searches of these SaaS applications. They each have different ways of searching for things, many do not have strong search capabilities, or they rely on their internal query language. There are also a ton of other signals that web search engines use that aren’t available natively in the tools. Examples include: backlinks, clickthrough rates, etc. Additionally a lot of teams rely on internal terminology that is unique to them and hard for the LLM to search for. There’s also the challenge of unifying the objects across all of the apps into a plaintext representation that works for the LLM.
The best way we’ve found to do this is to build a document index instead of relying on application native searches at query time. The document index is a hybrid index of keyword frequencies and vectors. The keyword component addresses issues like team specific terminology and the vector component allows for natural language queries and non-exact matching. Since all of the documents across the sources are processed prior to query time, inference is fast and all of the documents have already been mapped to an LLM friendly representation.
There are also other signals that we can take into account which are applied across all of the sources. For example, the time that a document was last updated is used to prioritize more recent documents. We also have models that run at indexing time to label documents and models that run at inference time to dynamically change the weights of the search function parameters.
- ai4fun20044 months ago
  Great work!
  When you talked about "document index is a hybrid index of keyword frequencies and vectors", I am a bit curious of how to get them. In pre-processing, do you have to use LLM / models go through documents to get keywords? What about vectors? Are you using embed model to generate them? Does that imply preprocess has to be done whenever there is new doc or any modification in existing doc? Would that be costly in time? Any spicy cook to make the preprocessing more efficient?
janalsncm4 months ago
The demo looked sharp but I am curious if you have done any formal evaluation of the quality of the results? For example, MRR and recall@k, even on a toy dataset? Seems like the quality of the generated responses will be highly dependent on the docs which are retrieved.
- yuhongsun4 months ago
  We have a dataset that we use internally to evaluate our search quality. It's more representative of our use case since it contains Slack messages, call transcripts, very technical design docs, company policies which is pretty different from what embedding models are typically trained on.
  We checked the recall at 4K tokens (which was a pretty typical token limit of the previous generation of LLMs) and we were at over 94% recall for our 10K document set. We also added a lot of noise to it (Slack messages from public Slack workspaces) to get hundreds of thousands of documents but recall remained at over 90%.
- robrenaud4 months ago
  I am also interested in how to do eval on an open source corporate search system. Privacy and information security make this challenging, right?
  - yuhongsun4 months ago
    On privacy and security, we are the only option (as far as I know) that you can connect up to all your company internal docs and have it be all processed locally to the deployment and stored at rest within the deployment.
    So basically you can have it completely airgapped from the outside world, the only tough part is the local LLM but there are lots of options for that these days.
laborcontract4 months ago
Cool product. Few Qs:
- What would you say is the agentic approach's special sauce over a typical RAG pipeline, ie query->multi-query generation->HyDE->vector search->bm25 search->RRF->rerank->evaluate->(retry|refuse|respond) that differentiates the approach?
- If a user has 20 services connected, how does the agent know how to call/search/traverse the information in the right order?
- Do you have any internal evals on how well the different model affect the overall quality of output, esp for a "deep search" type of task? I have model-picker fatigue.
- Do you plan to implement knowledge graphs in the future?
- yuhongsun4 months ago
  Quite a lot to cover here! So in addition to the typical RAG pipeline, we have many other signals like learning from user feedback, time based weighting, metadata handling, weighting between title/content, and different custom deep learning models that run at inference and indexing time all to help the retrieval. But this is all part of the RAG component.
  The agent part is the loop of running the LLM over RAG system and letting it decide which questions it wants to explore more (some similarities to retry|refuse|respond I guess?). We also have the model do CoT over its own results including over the subquestions it generates.
  Essentially it is the deep research paradigm with some more parallelism and a document index backing it.
  How does the agent traverse the information: there are index-free approaches where the LLM has to use the searches of the tools. This gives worse results than approaches that build a coherent index across sources. We use the latter approach. So the search occurs over our index which is a central place for all the knowledge across all connected tools.
  Do you have any internal evals on how well the different model affect the overall quality of output, esp for a "deep search" type of task? I have model-picker fatigue: Yes, we have datasets that we use internally. It comprises of "company type data" rather than "web type" data (like short Slack messages, very technical design documents, etc.) comprising about 10K documents and 500 questions.
  For which model to use: it was developed primarily against gpt-4o but we retuned the prompts to work with all the recent models like Claude 3.5, Gemini, Deepseek, etc.
  Do you plan to implement knowledge graphs in the future? Yes! We're looking into customizing LLM based knowledge graphs like LightGraphRAG (inspired by, but not the same).
  - timhigins4 months ago
    Do you think this indexing architecture would bring benefits to general web research? If implemented like: planner, searches, index webpages in chunks, search in index, response
    Would you ever extend your app to search the web or specialized databases for law, finance, science etc?
- janalsncm4 months ago
  If I understand correctly, they are indexing all of the docs together rather than relying on the agent to retrieve them.
  - lxe4 months ago
    That sounds like RAG though, right?
    yuhongsun4 months ago
    It's like how OpenAI's deep research works by searching the internet, ours works by searching over our "RAG" system that indexes company documents.
murshudoff4 months ago
From the demo it looks great. And the UX is pretty sick! Can it be used via API in the self-hosted version? If yes, what are the limitations over UI? In my use-case I would like to use it as a RAG infra and together with tool calling and generative UI in the client code (already implemented) it would complete the picture.
nikisweeting4 months ago
Very cool! One question: how do you handle permissions?
Different apps have different permissions models, not everyone is allowed to see everything. Do you attempt to model this complexity at all or normalize it to some general permissions model?
- yuhongsun4 months ago
  This is a large challenge in itself actually. Every external tool has it's own framework for permissions (necessarily so).
  For example, Google Drive docs have permissions like "global public", "domain public", "private" where "private" is shared with users and groups and there's also the document owner.
  Slack has public channels, private channels, DMs, group DMs.
  So we need to map these external objects and their external users/groups into a unified representation within Onyx.
  Then there are additional challenges like rate limiting so we cannot poll at subsecond intervals.
  The way that we do it is we have async jobs that check for object permission updates and group/user updates against the external sources at a configurable frequency (with defaults that depend on the external source type).
  Of course, always failing-closed instead of failing-open and defaulting to least permissive.
- thebeardisred4 months ago
  It appears the answer (if you want differentiated permissions, e.g. user vs admin, or role based access control) is "purchase the enterprise edition" - https://docs.onyx.app/enterprise_edition/overview
  edit: added clarification
PhilippGille4 months ago
Based on the Show HN description I was about to ask "How does it differ from Danswer", but apparently they rebranded from Danswer to Onyx.
Danswer Show HN: https://news.ycombinator.com/item?id=36667374
Danswer Launch HN: https://news.ycombinator.com/item?id=39467413
monkeydust4 months ago
I have recently found myself drawing parallels between deep research and knowledge working org structures.
For example,board exec asks senior exec a question about a particular product. The senior exec then has to fire off emails to say 5 managers, who might go down their tree to ICs, all the info is gathered and synthesised into a response.
Normally this response takes into account some angle the senior exec might have.
A lot of knowledge working tasks follow this pattern that I have somewhat simplified.
- yuhongsun4 months ago
  Yup, hopefully with Onyx, the folks who have these questions can just fire off a query with agent mode turned on and the LLM will research the relevant tree of knowledge and come back with an answer in the fraction of the time it would take for people to do it with all of the handoffs in between.
  - monkeydust4 months ago
    Yes. I suspect it won't, right now, be as good a response but where time or cost matters an 80% effective response in 10% of the time it would have taken or say 5% of cost (if you were to 'dollarize' worker effort) would present useful options to those asking.
mentalgear4 months ago
Thank you for open-sourcing this - very impressive and great UI ! This also marks the need for comprehensive benchmarks for the "large knowledge search, retrieval and analysis" category for users to effectively compare solutions and vendors to objectively compete on.
codenote4 months ago
Cool! As others have already commented, I'm curious about how well it can handle data access while considering the permissions of external services.
canadiantim4 months ago
How is the data stored? E.g. for concerns about internal data leaking out?
- yuhongsun4 months ago
  Assuming self-hosting, data is processed within the deployment with local deep learning models for embedding, identifying low information documents, etc. A hybrid keyword/vector index is built locally within the deployment as well.
  At rest, the data is stored in Postgres and Vespa (the hybrid index), both of which are part of the deployment so it's all local.
  The part that typically goes external is the LLM but many teams also host local LLMs to use with Onyx. In either case, the LLM is not being finetuned, the knowledge relevant to the question is passed in as part of the user message.
  We built Onyx with data security in mind so we're very proud of the way the data flows within the system. We made the system work well with models that can run without GPUs as well so our users can get good quality results even if deploying on a laptop.
lxe4 months ago
What's the difference between deep research and RAG?
- yuhongsun4 months ago
  RAG is a tool for the deep research agent to use in finding all of the context it needs. Deep research can call the search many times and reflect on the results of the previous searches and then search for other things as needed. Deep research flows can also generate chain-of-thought type outputs that is not for searching or for directly answering the user.
shrisukhani4 months ago
This is really neat guys - congrats!
sunrabbit4 months ago
Oh, so cool source
DebugDetective4 months ago
[dead]