With all the licensing complexities around Elastic, more choice is not necessarily bad.
The tradeoff with using S3 is indexing latency (the time between the write getting accepted and being visible via search) vs. easy scaling. The default refresh interval (the time the search engine waits before committing changes to an index) is 1 second. That means it takes upto 1 second before indices get updated with recently added data. A common performance tweak is to increase this to 5 or more seconds. That reduces the number of writes and can improve write throughput, which when you are writing lots of data is helpful.
If you need low latency (anything where users might want to "read" their own writes), clustered approaches are more flexible. If you can afford to wait a few seconds, using S3 to store stuff becomes more feasible.
Lucene internally stores documents in segments. Segments are append only and there tend to be cleanup activities related to rewriting and merging segments to e.g. get rid of deleted documents, or deal with fragmentation. Once written, having some jobs to merge segments in the background isn't that hard. My guess is that with S3, the trick is to gather whatever amount of writes up and then store them as one segment and put that in S3.
S3 is not a proper file system and file operations are relatively expensive (compared to a file system) because they are essentially REST API calls. So, this favors use cases where you write segments in bulk and never/rarely update or delete individual things that you write. Because that would require updating a segment in S3, which means deleting and rewriting it and then notifying other nodes somehow that they need to re-read that segment.
For both Elasticsearch and Opensearch log data or other time series data fits very well to this because you don't have to deal with deletes/updates typically.
Really, nothing is ever new in computing.
Hair-splitting: I don't believe Blob Storage is S3 compatible, so one may want to consider rewording to distinguish between whether it really, no kidding, needs "S3 compatible" or it's a euphemism for "key value blob storage"
I'm fully cognizant of the 2017 nature of this, but even they are all "use Minio" https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazo... which I guess made a lot more sense before its license change. There's also a more recent question from 2023 (by an alleged Microsoft Employee!) with a very similar "use this shim" answer: https://learn.microsoft.com/en-us/answers/questions/1183760/...
The alternative take is that trying to have a party on your remote island is objectively a bad party because no one else will come
Okay I'm being a bit harsh, but honestly gateway-mode Minio was 1000% better than their current offering.
It is based on Tantivy, a Lucene alternative in Rust. I have extensive hands on experience with both and I highly recommend Tantivy, it’s just superior in every way now, such a pleasure to use, an ideal example of what Rust was designed for.
Where can I find more information on using it for user-facing search? The repository [0] starts with "Cloud-native search engine for observability (logs, traces, and soon metrics!)" and keeps talking about those.
Their original breakout demo was on Common Crawl: https://common-crawl.quickwit.io/
But thanks for pointing it out, I hadn't looked at it in a few months, it looks like they significantly changed their pitch in the last year. I assume they got VC money and they need to deliver now.
I tried "England is" and a few similar queries. It spends three seconds then shows that nothing is found.
The blog post about the demo is from 2021 and they haven't promoted it much since. I'm surprised that they even kept it online, according to the sidebar it was ~$810/month in AWS at the time.
Its been about I have had running in some capacity for some years by now through a couple of rewrites. At some point Discord added "auto-complete" for commands, this meant that I can do a live lookup and give users a list of comics where some piece of text is.
My index is a bit out of date, but comics before September last year can be searched up.
The search index lives fully in memory as it is not that big since it is only 17363 comics. This does mean that it is rebuilt every startup, but that does not take long compared to the month long uptime it usually has.
Example of a search for "funny joke": https://imgur.com/a/J4sRhPJ
Hosted bot: https://discord.com/application-directory/404364579645292564
Source code: https://git.sr.ht/~erk/lasagna
https://www.meilisearch.com/docs/learn/resources/comparison_...
All of that said, I still use it because it has sucked less than the other search engines to run.
It lacks tons of features ES and Solr have, most notably geo search, but what it does it does a lot faster.
https://lucene.apache.org/core/7_7_0/core/org/apache/lucene/...
Tantivy is a low-level library to build your own search engine (Quickwit), like Lucene, it's not a search engine in itself. Kind of like how DBs are built on top of Key-Value Stores. But you can definitely build a CRUD abstraction on top of it.
Please do advise if I’ve missed something here. I was really excited about using quickwit for a project but have gone with Meilisearch precisely for these reasons. Otherwise it would be quickwit all the way.
I'm guessing Warp is Warpstream which I have been chomping at the bit to try out: https://hn.algolia.com/?q=warpstream
Once you hook these backends up to real-time streaming updates, transactions, heavy indexing, or immutable backends that cause constant churn (hive/hudi/iceberg/delta lake), you're in for a bad time financially.
As someone who spent the last decade and half getting alerts from RDBMSes I’m basically to the point that if you think your system requires more than object storage for state management, I don’t want to be involved.
My last company looked at rolling out elastic/open search to alleviate certain loads from our db, but it became clear it was just going to be a second monstrously complicated system that was going to require a lot of care and feeding, and we were probably better off spending the time trying to squeeze some additional performance out of our DB.
Personally I think you might actually need a DB to do the work of a DB, and you can't as easily build one on top of a blob store as on a block device. But I do think most distributed systems should use blob and/or DB and not the filesystem.
Ops is easier, for the most part. Doing ops on an RDBMS correctly can be a pain. Things like replication, failover, performance tuning, etc etc can be hard. This is much less of an issue because services like RDS solve this and have solved it for a long time. Not a huge issue there.
Splitting compute from storage makes scaling a lot easier, especially when storage is an object store system where you don't have to worry about RAID, disk backups, etc etc. Especially for clustered systems like elasticsearch, having object store backing would be incredible: if you need to spin up/down a new server, instead of starting it, convincing it to download the portions of the indexes it's supposed to and waiting for everything to transfer, you just start it and let it run immediately. You can also now run 80% spot instances for your compute nodes because if one gets recalled, the replacement doesn't have to sync all its state from the other servers, it can just go to business as usual, and a sudden loss of 60% of your nodes doesn't mean data loss like it does if your nodes are holding all the state.
I think for something like an RDBMS, object-store backing is very likely completely overkill, unless you're hitting some scaling threshold that most of us don't deal with ever. For clustered DB systems (cassandra/scylla, ES, etc etc), splitting out storage makes cluster management, scalability, and resiliency worlds easier.
In my analysis of 1 Million hacker news stories, it worked much better than algolia search while running on a single core ! More details are provided in this post: https://eagledot.xyz/malhar.md.html . I tried to submit it here to gather more feedback but didn't work i guess!
That often involves recording original trigram position, but I think that's necessary to weigh "I like happy cats" higher than "I like happy dogs but I don't like cats" in a search for "happy cats".
In many ways Solr has come a long way since, and I'm curious to see how well they can make a similar system perform in the cloud environment.
Answere: it is not.
You can certainly jump over it but I imagine a number of people like myself just skip the article entirely.
Just started a search project, and this one will be on the list for sure.
I think rclone mount can do the same thing with its chunked reads + cache, wonder what's the memory overhead for the process.
Local caching is a key element of such architectures, otherwise S3 is too slow and expensive to query.
The universe is a stateful organism in constant flux.
Put another way: brushing-it-under-the-rug as a service.
Put it another way: serverless and stateless don't mean what you think they mean.
Compute availability is lumped into one gigantic pool and all of the concerns below the execution of your code is managed for you.