Nixiesearch: Running Lucene over S3, and why we're building a new search engine(nixiesearch.substack.com)

121 pointsby shutty2 days ago23 comments

jillesvangurp2 days ago
Both Elastic and Opensearch also have S3 based stateless versions of their search engines in the works. The Elastic one is available in early access currently. It would be interesting to see how this on improves on both approaches.
With all the licensing complexities around Elastic, more choice is not necessarily bad.
The tradeoff with using S3 is indexing latency (the time between the write getting accepted and being visible via search) vs. easy scaling. The default refresh interval (the time the search engine waits before committing changes to an index) is 1 second. That means it takes upto 1 second before indices get updated with recently added data. A common performance tweak is to increase this to 5 or more seconds. That reduces the number of writes and can improve write throughput, which when you are writing lots of data is helpful.
If you need low latency (anything where users might want to "read" their own writes), clustered approaches are more flexible. If you can afford to wait a few seconds, using S3 to store stuff becomes more feasible.
Lucene internally stores documents in segments. Segments are append only and there tend to be cleanup activities related to rewriting and merging segments to e.g. get rid of deleted documents, or deal with fragmentation. Once written, having some jobs to merge segments in the background isn't that hard. My guess is that with S3, the trick is to gather whatever amount of writes up and then store them as one segment and put that in S3.
S3 is not a proper file system and file operations are relatively expensive (compared to a file system) because they are essentially REST API calls. So, this favors use cases where you write segments in bulk and never/rarely update or delete individual things that you write. Because that would require updating a segment in S3, which means deleting and rewriting it and then notifying other nodes somehow that they need to re-read that segment.
For both Elasticsearch and Opensearch log data or other time series data fits very well to this because you don't have to deal with deletes/updates typically.
- rakooa day ago
  I'm wondering if it would be better to have a LevelDB-like approach here. Store the recent stuff in DynamoDB, and once it hits the threshold, store it in a segment in S3. This is also similar to SQLite and WAL.
  Really, nothing is ever new in computing.
mdaniel2 days ago
> Nixiesearch uses an S3-compatible block storage (like AWS S3, Google GCS and Azure Blob Storage)
Hair-splitting: I don't believe Blob Storage is S3 compatible, so one may want to consider rewording to distinguish between whether it really, no kidding, needs "S3 compatible" or it's a euphemism for "key value blob storage"
I'm fully cognizant of the 2017 nature of this, but even they are all "use Minio" https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazo... which I guess made a lot more sense before its license change. There's also a more recent question from 2023 (by an alleged Microsoft Employee!) with a very similar "use this shim" answer: https://learn.microsoft.com/en-us/answers/questions/1183760/...
- ko_pivot2 days ago
  Azure is the only major (or even minor) cloud provider refusing to build an S3 API. Strange to me, because Azure Cosmos DB supports Mongo and Cassandra at the API level, for example, so idk what is so offensive to them about S3 becoming the standard HTTP API for object storage.
  - ignaloidas2 days ago
    It's because S3 api is quite a fair bit worse than what they offer. They define their guarantees for storage products way more clearly than other clouds, and for blob storage, from my understanding, their model is better than S3.
    mdaniela day ago
    https://en.wikipedia.org/wiki/Worse_is_better
    The alternative take is that trying to have a party on your remote island is objectively a bad party because no one else will come
    15 hours ago
    undefined
- aftbit11 hours ago
  Oh yeah, back when Minio was good, before they decided that POSIX filesystems were "legacy technology" and nobody could want to use their product in combination with the cloud, only as an equally-locked-in replacement.
  Okay I'm being a bit harsh, but honestly gateway-mode Minio was 1000% better than their current offering.
oersted2 days ago
Check out Quickwit, it is briefly mentioned but I think mistakenly dismissed. They have been working on a similar concept for a few years and the results are excellent. It’s in no way mainly for logs as they claim, it is a general purpose cloud native search engine like the one they suggest, very well engineered.
It is based on Tantivy, a Lucene alternative in Rust. I have extensive hands on experience with both and I highly recommend Tantivy, it’s just superior in every way now, such a pleasure to use, an ideal example of what Rust was designed for.
- Semaphor2 days ago
  > It’s in no way mainly for logs as they claim
  Where can I find more information on using it for user-facing search? The repository [0] starts with "Cloud-native search engine for observability (logs, traces, and soon metrics!)" and keeps talking about those.
  [0]: https://github.com/quickwit-oss/quickwit
  - oersted2 days ago
    That just seems to be the market where search engines have the most obvious business case, Elasticsearch positioned themselves in the same way. But both are general-purpose full-text search engines perfectly capable of any serious search use-case.
    Their original breakout demo was on Common Crawl: https://common-crawl.quickwit.io/
    But thanks for pointing it out, I hadn't looked at it in a few months, it looks like they significantly changed their pitch in the last year. I assume they got VC money and they need to deliver now.
    AsianOtter2 days ago
    But the demo does not work.
    I tried "England is" and a few similar queries. It spends three seconds then shows that nothing is found.
    oersted2 days ago
    I tried it once and it instantly showed no results, but then I tried it again and it returned results in <1s. Just try it with a bunch of queries, I think there's caching too so it's hard to gauge performance properly.
    The blog post about the demo is from 2021 and they haven't promoted it much since. I'm surprised that they even kept it online, according to the sidebar it was ~$810/month in AWS at the time.
    fulmicotona day ago
    Yes. We should shut down this demo. We reduced the hardware to cut down our costs. Right now it runs a ludicrously small amount of hardware.
  - hovering_nox2 days ago
    I would say here: Features
    https://quickwit.io/docs/overview/introduction#key-features
- lsowena day ago
  Has anyone tried openobserve (https://github.com/openobserve/openobserve)? How does it compare/contrast to Quickwit as an "Elasticsearch for logs" replacement?
- erk__2 days ago
  I have been using Tantivy for Garfield comic search for a few years now, it has been really nice to use in all that time.
  - jprd2 days ago
    I'm simultaneously intrigued and thinking this is a funny joke at the same time. If this isn't a joke, I would love an example.
    erk__a day ago
    Luckily it is not a joke!
    Its been about I have had running in some capacity for some years by now through a couple of rewrites. At some point Discord added "auto-complete" for commands, this meant that I can do a live lookup and give users a list of comics where some piece of text is.
    My index is a bit out of date, but comics before September last year can be searched up.
    The search index lives fully in memory as it is not that big since it is only 17363 comics. This does mean that it is rebuilt every startup, but that does not take long compared to the month long uptime it usually has.
    Example of a search for "funny joke": https://imgur.com/a/J4sRhPJ
    Hosted bot: https://discord.com/application-directory/404364579645292564
    Source code: https://git.sr.ht/~erk/lasagna
- ZeroCool2u2 days ago
  Meili search is also a great option.
  https://www.meilisearch.com/docs/learn/resources/comparison_...
  - notamya day ago
    Meilisearch is great when it works, but when it breaks it's a total nightmare. I've hit multiple bugs that destroyed my search index, I've hit multiple undocumented limits, ... that all required rebuilding my index from scratch and doing a lot of work to find what was actually going on to report it. It doesn't help that some of the errors it gives are incredibly non-specific and make it quite difficult to find what's actually breaking it.
    All of that said, I still use it because it has sucked less than the other search engines to run.
  - orthecreedence2 days ago
    Does Meili support object store backends?
- fiedziaa day ago
  > Tantivy, it’s just superior in every way now
  It lacks tons of features ES and Solr have, most notably geo search, but what it does it does a lot faster.
- bomewish2 days ago
  The big issue with tantivy I've found is that it only deals with immutable data. So it can't be used for anything you want to do CRUD on. This rules out a LOT of use cases. It's a real shame imo.
  - pentlander2 days ago
    I’m pretty sure that Lucene is exactly the same, the segments it creates are immutable and Elastic is what handles a “mutable” view of the data. Which makes sense because Tantivy is like Lucene, not ES.
    https://lucene.apache.org/core/7_7_0/core/org/apache/lucene/...
  - oersted2 days ago
    It is indeed mostly designed for bulk indexing and static search. But it is not a strict limitation, frequent small inserts and updates are performant too. Deleting can be a bit awkward, you can only delete every document with a given term in a field, but if you use it on a unique id field it's just like a normal delete.
    Tantivy is a low-level library to build your own search engine (Quickwit), like Lucene, it's not a search engine in itself. Kind of like how DBs are built on top of Key-Value Stores. But you can definitely build a CRUD abstraction on top of it.
    bomewish20 hours ago
    Certainly one could build that on top of quickwit — which also doesn’t allow crud — but it’s not trivial. You need to insert a new record with changes, then delete the record you want to update. The docs instruct that the latter action is expensive and might take hours(!). Then one would need a separate process to ensure this all went down appropriately (say server crashes after insert of updated record but before delete succeeds). Meanwhile you’ve got almost identical records able to be searched. Just not very nice for anything involving CRUD.
    Please do advise if I’ve missed something here. I was really excited about using quickwit for a project but have gone with Meilisearch precisely for these reasons. Otherwise it would be quickwit all the way.
- groundera day ago
  Quickwit indicates it is for immutable data and not to be used for mutable data. Is that the case in your experience?
- victor1062 days ago
  Thanks for this info.
gyre0072 days ago
It took us almost 2 decades but finally the truly cloud native architectures are becoming a reality. Warp and Turbopuffer are some of the many other examples
- candiddevmike2 days ago
  Curious what your definition of cloud native is and why you think this is a new innovation. Storing your state in a bunch of files on a shared disk is a tale as old as time.
  - cowsandmilk2 days ago
    Not having to worry about the size of the disk for one. So much time in on that premise systems was about managing quotas for systems and users alongside the physical capacity.
- mdaniel2 days ago
  I didn't recognize Turbopuffer but a quick search coughed up a previous discussion: https://news.ycombinator.com/item?id=40916786
  I'm guessing Warp is Warpstream which I have been chomping at the bit to try out: https://hn.algolia.com/?q=warpstream
- Sirupsena day ago
  Ya, the world needed S3 to become fully consistent. This didn't happen until end of 2020!
hipadev23a day ago
I know block storage backends is all the rage, but this is about the most capital intensive thing you can do on the major cloud providers. Storage and reads are cheap, but writes and list operations are insanely expensive.
Once you hook these backends up to real-time streaming updates, transactions, heavy indexing, or immutable backends that cause constant churn (hive/hudi/iceberg/delta lake), you're in for a bad time financially.
mikeocool2 days ago
I love all of the software coming out recently backed by simple object storage.
As someone who spent the last decade and half getting alerts from RDBMSes I’m basically to the point that if you think your system requires more than object storage for state management, I don’t want to be involved.
My last company looked at rolling out elastic/open search to alleviate certain loads from our db, but it became clear it was just going to be a second monstrously complicated system that was going to require a lot of care and feeding, and we were probably better off spending the time trying to squeeze some additional performance out of our DB.
- spaceribs2 days ago
  This is a very unix philosophy right? Everything is a file?[1]
  [1]https://en.wikipedia.org/wiki/Everything_is_a_file
  - pjc502 days ago
    Not quite - "everything is a blob" has very different concurrency semantics to "everything is a POSIX file". You can't write into the middle of a blob, for example. This makes certain use cases harder but the concurrency of blobs is much easier to reason about and get right.
    Personally I think you might actually need a DB to do the work of a DB, and you can't as easily build one on top of a blob store as on a block device. But I do think most distributed systems should use blob and/or DB and not the filesystem.
- remram2 days ago
  On the other hand, the S3-compatible server options are quite limited. While you're not locking yourself to one cloud, you are locking yourself to the cloud.
  - mikeocool2 days ago
    At this point my career, I’ve found that paying to make something hard someone else’s is often well worth it.
- candiddevmike2 days ago
  Why would you prefer state management in object storage vs a relational (or document) database?
  - orthecreedencea day ago
    Two main reasons I can see:
    Ops is easier, for the most part. Doing ops on an RDBMS correctly can be a pain. Things like replication, failover, performance tuning, etc etc can be hard. This is much less of an issue because services like RDS solve this and have solved it for a long time. Not a huge issue there.
    Splitting compute from storage makes scaling a lot easier, especially when storage is an object store system where you don't have to worry about RAID, disk backups, etc etc. Especially for clustered systems like elasticsearch, having object store backing would be incredible: if you need to spin up/down a new server, instead of starting it, convincing it to download the portions of the indexes it's supposed to and waiting for everything to transfer, you just start it and let it run immediately. You can also now run 80% spot instances for your compute nodes because if one gets recalled, the replacement doesn't have to sync all its state from the other servers, it can just go to business as usual, and a sudden loss of 60% of your nodes doesn't mean data loss like it does if your nodes are holding all the state.
    I think for something like an RDBMS, object-store backing is very likely completely overkill, unless you're hitting some scaling threshold that most of us don't deal with ever. For clustered DB systems (cassandra/scylla, ES, etc etc), splitting out storage makes cluster management, scalability, and resiliency worlds easier.
  - mikeocool2 days ago
    So many less moving parts to manage/break.
warangal2 days ago
I myself have been working on a personal search engine for sometime, and one problem i faced was to have an effective fuzzy-search for all the diverse filenames/directories. All approaches i could find were based on Levenshtein distance , which would have led to storing of original strings/text content in the index, and neither would be practical for larger strings' comparison nor would be generic enough to handle all knowledge domains. This led me to start looking at (Local sensitive hashes) LSH approaches to measure difference b/w any two strings in constant time. After some work i finally managed to complete an experimental fuzzy search engine (keyword search is a just a special case!).
In my analysis of 1 Million hacker news stories, it worked much better than algolia search while running on a single core ! More details are provided in this post: https://eagledot.xyz/malhar.md.html . I tried to submit it here to gather more feedback but didn't work i guess!
- iudqnolq2 days ago
  I'm super new to this so I'm probably missing something simple, but isn't a trigram index one of the canonical solutions for fuzzy search? Eg https://www.postgresql.org/docs/current/pgtrgm.html
  That often involves recording original trigram position, but I think that's necessary to weigh "I like happy cats" higher than "I like happy dogs but I don't like cats" in a search for "happy cats".
  - warangal2 days ago
    Yes, trigram mainly but also bigram and/or combination of both are used generally to implement fuzzy search, zoekt also uses trigram index. But such indices depend heavily on the content being indexed, for example if ever encounter a rare "trigram" during querying not indexed, they would fail to return relevant results! LSH implementations on the other hand employ a more diverse collection of stats depending upon the number of buckets and N(-gram)/window-size used, to compare better with unseen content/bytes during querying. But it is not cheap as each hash is around 30 bytes, even more than the string/text being indexed most of the time ! But its leads to fixed size hashes independent of size of content indexed and acts as an "auxiliary" index which can be queried independently of original index! Comparison of hashes can be optimized leading to a quite fast fuzzy search .
mhitza2 days ago
I've used offline indexing with Solr back in 2010-2012, and this was because the latency between the Solr server and the MySQL db (indexing done via dataimport handler) was causing the indexer to take hours instead of the sub 1 hour (same server vs servers in same datacenter).
In many ways Solr has come a long way since, and I'm curious to see how well they can make a similar system perform in the cloud environment.
novoreorxa day ago
It seems that some of the goals and functionalities of Nixiesearch overlap with those of Turbopuffer [^1], though the latter is only focusing on vector search. I also resonate that search engine should be stateless and affordable to deploy for everyone.
[1]: https://turbopuffer.com/blog/turbopuffer
a day ago
undefined
whalesalad2 days ago
I recently got back into search after not touching ES since like 2012-2013. I forgot how much of a fucking nightmare it is to work with and query. Love to see innovation in this space.
- staticautomatic2 days ago
  I feel like it’s not that bad to interact with if you do it regularly, but if I go a while without using it I forget how to do everything. I sure as hell wouldn’t want to admin an instance.
marginalia_nu2 days ago
This would have been a lot easier to read without all the memes and attempts to inject humor into the writing. It's a frustrating because it's an otherwise interesting topic :-/
- prmoustache2 days ago
  How hard is it to just jump past them?
  Answere: it is not.
  - infecto2 days ago
    It generally is a major distraction from the content and feels like a pattern from a decade+ ago when technical blog posts became the hot thing to do.
    You can certainly jump over it but I imagine a number of people like myself just skip the article entirely.
  - vundercind2 days ago
    I like the style, but this case felt forced. Like when corporate tries to do memes.
  - Semaphor2 days ago
    It is.
mannyv2 days ago
I forgot that a reindex on solr/lucene blows away the index. Now I remember how much of a nightmare that was because you couldn't find anything until that was done - which usually was a few hours when things were hdd based.
Just started a search project, and this one will be on the list for sure.
tomhamera day ago
I might be missing something but how is this different to amazon opensearch with ultrawarm storage? I think amazon launched this about 4 years ago right?
manx2 days ago
I thought about creating a search engine using https://github.com/phiresky/sql.js-httpvfs, commoncrawl and cloudflare R2. But never found the time to start...
- oersted2 days ago
  You will like this then, that was the main demo from the Quickwit team.
  https://common-crawl.quickwit.io/
- mallets2 days ago
  Many things seem feasible with competitive object storage pricing. Still needs a little a bit of local caching to reduce read requests and origin abuse.
  I think rclone mount can do the same thing with its chunked reads + cache, wonder what's the memory overhead for the process.
ko_pivot2 days ago
I’m a fan of all these projects that are leveraging S3 to implement high availability / high scalability for traditionally sensitive stateful workloads.
Local caching is a key element of such architectures, otherwise S3 is too slow and expensive to query.
- candiddevmike2 days ago
  The write speed is going to be horrendous IME, and how do you handle performant indexing...
a day ago
undefined
huntaub2 days ago
This is a super cool project, and I think that we will continue to see more and more applications move towards an "on S3" stateless architecture. That's part of the reason why we are building Regatta [1]. We are trying to enable folks who are running software that needs file system semantics (like Lucene) to get the super-fast NVME-like latencies on data that's really in S3. While this is awesome, I worry about all of the applications which don't have someone rewrite a bunch of layers to work on S3. That's where we come in.
[1] https://regattastorage.com
parhamna day ago
Stateless S3 apps have much more appeal given the existence of Cloudflare R2 -- bandwidth is free and GetObject is $0.36 per million requests.
drastic_fred21 hours ago
In a world, recommendations outpaced the full text search (95%/5%), cost reduction is essential.
ctxcode2 days ago
Sounds like this is going to cost alot of money. (more than it should)
stroupwaffle2 days ago
There’s no such thing as stateless, and there’s no such thing as serverless.
The universe is a stateful organism in constant flux.
Put another way: brushing-it-under-the-rug as a service.
- zdragnar2 days ago
  There is no spoon.
  Put it another way: serverless and stateless don't mean what you think they mean.
  - MeteorMarc2 days ago
    I feel clueless
    stroupwaffle2 days ago
    It’s not the spoon that bends, it’s the world around it.
    ctxcode2 days ago
    serverless just means that a hosting company routes your domain to one or more servers that hosting company owns and where they put your code on. And that hosting company can spin up more or less servers based on traffic.. TL;DR; Serverless uses many many servers, just none that you own.
    zdragnar2 days ago
    More specifically: no instances that you maintain or manage. You don't care which machine your code runs on, or even if all your code is even on the same machine.
    Compute availability is lumped into one gigantic pool and all of the concerns below the execution of your code is managed for you.
cynicalsecurity2 days ago
This is a great way to waste investors' money.