Not quite. You should clarify a bit more. The README has this about their license.
"Certain features - such as Morphik Console - are not available in the open-source version. Any feature in the ee namespace is not available in the open-source version and carries a different license. Any feature outside that is open source under the MIT expat license."
Does such a project exist?
https://github.com/rmusser01/tldw
I first built a POC in gradio and am now rebuilding it as a FastAPI app. The media processing endpoints work but I’m still tweaking media ingestion to allow for syncing to clients(idea is to allow for client-first design). The GitHub doesn’t show any of the recent changes, but if you check back in 2-3 weeks, I think I’ll have the API version pushed to the main branch.
Since people will be curious, one lesser thing I used this for is a diary/assistant and it's nice to have the peace of mind that I can dump my inner most thoughts without any concern for oversharing.
Curious about suitability of this for PDF's as conference presentation slides vs academic papers. Is this sensitive or tunable to such distinctions?
Looking for tests/validation; are they all in the evaluation folder? A Pharma example would be great.
Thank you for documenting the telemetry. I appreciate the ee commercialization dance :)
Creating graphs and entity resolution are both tunable with overrides, you can specify domain specific prompts and overrides (will add a pharma example!) (https://docs.morphik.ai/python-sdk/create_graph#parameters). I tried to add code, but was formatting badly, sorry for the redirect.
Minor nitpick, but the README for your ui-component project under ee says:
"License This project is part of Morphik and is licensed under the MIT License."
However, your ee folder has an "enterprise" license, not the MIT license.
For the metadata extraction, we save these as Column(JSONB) for each documents which allows it to be changed on the fly.
Although, I keep wondering if it would have been better to use something like mongodb for this part, just because it's more natural.
Please let me know if you have questions and how it works out for you.
If you're using txts, then plain RAG built on top of any vector database can suffice depending on your queries (if they directly reference the text, or can be made to, then similarity search is good enough). If they are cross document, setting a high number of chunks with plain RAG to retrieve might also do a good job.
If you have tables, images, etc. then using a better extraction mechanism (maybe unstructured, or other document processors) and then creating the embeddings can also work well.
I'd say if docs are simple, then just building your own pipeline on top of a vector db is good!
I'd be happy to report back after some testing, we are looking to optimize more of this soon, as speed is somewhat of a missing piece at the moment.