Bloom filters are good for search that does not scale(notpeerreviewed.com)

201 pointsby birdculture2 days ago13 comments

susam2 days ago
When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.
However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.
With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.
The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.
A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.
- susam2 days ago
  I just noticed the m = 10007 value in my comment above and I thought I should clarify it. The number of bits per bloom filter does not need to be a prime number if the hash functions have uniform distribution. Murmur2 hash functions do have uniform distribution, so m was not chosen to be prime in order to reduce collisions in the Bloom filter's bit positions. The reason for using a prime value was more mundane.
  This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.
  - anonymars2 days ago
    Ha, collision avoidance all the way down
  - gkfasdfasdf2 days ago
    Interesting trick about the constant value, and thank you for the detailed write up!
  - xandrius2 days ago
    [flagged]
- twotwotwoa day ago
  There is a largish category of tools now where, unlike in OLTP systems, a big focus is scanning data but quickly (O(n) but with a good constant): Redshift, Trino/Athena, ClickHouse, DuckDB among others.
  Bloom filter indexing seems like a great fit if you ever need to do substring searches in a context like that, and for log searching in general. I haven't dug into what all packages have it, but it looks like at least ClickHouse does: https://clickhouse.com/docs/optimize/skipping-indexes#bloom-...
- UltraSane2 days ago
  Splunk uses bloom filters to make searching for rare events fast. Rare events are usually the most interesting.
  - hinkley2 days ago
    I’ve only used Splunk with one set of devs and maybe we were doing it wrong, but it didn’t feel fast to me.
    Several of us were working hard to move everything into Prometheus that made any sense to be in Prometheus instead of Splunk.
    Notably any time we had a production issue that it was unclear which team was responsible, Splunk became the bottleneck because we started exceeding quotas immediately.
    UltraSane2 days ago
    Splunk is one of the best software I've ever used but it HAS to be used with very fast storage to be effective. I've only used it on enterprise grade storage arrays and servers with lots of RAM for caches. On modern PCIe 5.0 NVMe drives it is stupid fast.
    I'm not sure what you mean by exceed quotas because Splunk is normally licensed on GB ingested per day. This can lead to bitter fights between teams over how this is allocated.
    The good thing about this license model is that you can use as much hardware as you want for no extra license cost.
    hinkley2 days ago
    > used with very fast storage
    That’s sounds like self hosting. Which is not the only product they offer. But you still have hardware that can only run so many queries at once and then starts queuing any additional request, yeah? Once you have a dozen people on a call it went to shit. Only occasionally ran into problems like this with graphite. But you need a lot of people looking at a very large dashboard to start feeling refresh delays.
- 65102 days ago
  I'm not an expert at all, I learn they exist after making something similar to search a few thousand blog posts.
  Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.
  A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.
  a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw
  The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.
  I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.
  Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e
  Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.
  But it works and loading a hand full of files or doing an AND is amazingly fast.
  - susam2 days ago
    What you've described here is an n-gram inverted index (with n = 2) represented as a bitset. We could call it a bigram bitset inverted index. Glad to know you designed and implemented all of this from first principles, and that it serves you well!
  - hinkley2 days ago
    I had a problem where we needed to compare large data sets between machines for keys that existed in both, and the bandwidth cost just wasn’t mathing for the median result set size. I was trying to figure out how to send a fingerprint from machine A to B, then have machine B send the hits back. Or how many round trips I could do based on set size to minimize bandwidth + latency. I ended up with a calculus problem nobody could help me solve because of an n^5 term.
    My boss was generally pretty good with obscure data structures but neither of us had encountered Bloom filters. This was about three years after Google published their paper on how they were using Bloom filters, but that company would be bankrupt before I figured it out.
- NooneAtAll3a day ago
  what's RSA?
  - kristianpa day ago
    I assume it's RSA Security https://en.wikipedia.org/wiki/RSA_Security
MattPalmer10862 days ago
May be true for offline full text search, but not true for online string search.
I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].
[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...
[2] https://sea2024.univie.ac.at/accepted-papers/
[3] https://github.com/nishihatapalmer/HashChain
- lll-o-lll2 days ago
  Looks really interesting! If I wanted to try one of these, which of the family would you recommend to start with?
  - MattPalmer10862 days ago
    Hashchain is the easiest to implement, but like many search algorithms can suffer from quadratic performance on really bad data and patterns (e.g. seaching for a long sequence of zero bytes in a text of zero bytes). In practice this is very rare.
    If you want guaranteed linear performance in the worst case too, then LinearHashchain is the one to use. It is slightly more complex to implement as it builds in a KMP style verifier to make it linear (so almost two search algorithms are needed). It is actually about as fast as HashChain generally in the average case, so you don't lose out.
    The others are either experimental or quite niche and not suitable for most purposes. SentinelHashchain is actually the fastest, but relies on being able to add a copy of the pattern at the end of the search text. Mostly this won't be possible in most search contexts, unless you control all memory allocations.
    So I'd start with HashChain, and maybe play with the linear version later - most of it is the same, it just needs a bit more adding.
pncnmnp2 days ago
When my friends and I were undergrads (3rd year, I believe), we had an absolute blast exploring this exact topic - the intersection of Bloom Filters and client side searching. So much so that it became part of our undergrad thesis.
It all started when Stavros's blog was circulated on Hacker News! The way we approached the search part was by using "Spectral Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi Matias from the early 2000s - its basically an iteration on the counting bloom filters. We used the minimal selection and minimal increase algorithm from the paper for insertion and ranking of results.
I wrote a blog on it too - https://pncnmnp.github.io/blogs/spectral-bloom-filters.html
Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf
adamzwasserman2 days ago
The "no sharing between filters" insight clicked for me on a different problem.
I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.
Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.
TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.
- hinkley2 days ago
  Why do these “inverted indexes” just look like indexes to me? Too much time with databases perhaps?
  - farsaa day ago
    The distinction is more clear when indexing actual text and applying tokenization. A "typical" index on a database column goes like "column(value => rows)". When people mention inverted indexes its usually in the context of full text search, where "column value" usually goes through tokenization and you build an index for all N tokens of a column "column:(token 1 => rows)", "column:(token 2 => rows)",... "column:(token N => rows)".
  - adamzwasserman2 days ago
    A non-unique index, yes.
    hinkley2 days ago
    Which is most indexes.
    adamzwasserman16 hours ago
    again, I agree
pauldix2 days ago
I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...
It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.
I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).
There's also code online here: https://github.com/bingmann/cobs
majke2 days ago
Bing uses bloom filters for the most-recent index:
https://dl.acm.org/doi/pdf/10.1145/3077136.3080789
https://bitfunnel.org/strangeloop/
cristaloleg13 hours ago
Curious why no one mentioned XOR filter yet
https://github.com/FastFilter
- cristaloleg13 hours ago
  Quick intro https://lemire.me/blog/2019/12/19/xor-filters-faster-and-sma...
hijinks2 days ago
is there a better way then bloom filters to handle needle in the haystack type searches where the haystack might be terabytes of data and you only want a few lines?
- philipkglass2 days ago
  There are a lot of "better than Bloom" filters that work similarly in some aspects. I have used Cuckoo [1] and Ribbon [2] filters for Bloom-type applications. If you have an application where you do a lot of one kind of searching, it may also be worth implementing a specialized variant of a data structure. I needed a Cuckoo-type filter on the JVM but only for 64 bit integers and I was able to make a smaller, faster code base that was specialized to this data type instead of handling generic objects.
  You need to know up front whether you need to be able to dynamically add entries to the filter or if your application can tolerate rebuilding the filter entirely whenever the underlying data changes. In the latter case you have more freedom to choose data structures; many of the modern "better than Bloom" filters are more compact but don't support dynamic updates.
  [1] https://en.wikipedia.org/wiki/Cuckoo_filter
  [2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-filt...
  - hinkley2 days ago
    I wonder how often in the wild people are tuning for a 1% false positive rate versus a much lower one, like .1%. You do quickly reach data set sizes where even 1% introduces some strain on resources or responsiveness.
    Cuckoo claims 70% of the size of bloom for the same error rate, and the space is logarithmic to the error rate. Looks like about 6.6 bits per record versus 9.56 bits for bloom at 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you can get to about a .13% error rate for a cuckoo only a hair larger than the equivalent bloom filter (n^9.567 = 758.5)
    FreakLegiona day ago
    Cuckoo filters can do even better with the small adjustment of using windows instead of buckets. See "3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit": https://scispace.com/pdf/3-5-way-cuckoo-hashing-for-the-pric.... (This significantly improves load factors rather than changing anything else about the filter, and ends up smaller than the semi-sorted variant for typical configurations, without the rigmarole.)
    My fairly niche use case for these kinds of data structures was hardware firewalls running mostly on SRAM, which needed a sub one-in-a-billion false positive rate.
  - hijinks2 days ago
    thanks.. i'll read up into these.. always amazes me that companies like datadog somehow made log search quick
hinkley2 days ago
> Fun fact: There is a nice implementation of this exact algorithm that is still used in the wild.
I thought that was going to be a link to Google.com
taeric2 days ago
I will forever think of Bloom filters as "bouncer filters." Could go with concierge filter. Basically, it is the equivalent of every movie where the detective is asking the front desk various attributes of who they are looking for.
It is not hard to see how you could start asking the front desk to track every obscure attribute and to expect to fall over for various reasons.
pi_22by72 days ago
The key insight about bloom filters lacking synergy is excellent. The ~7K document crossover point makes sense because inverted indexes amortize dictionary storage across all documents while bloom filters must encode it linearly per document
- hinkley2 days ago
  But doesn’t that depend on the cardinality of the indexes versus the document count? I’ve seen systems with a stupid number of tag values.
KevBurnsJr2 days ago
Reminds me of @danthegoodman's project: bloomsearch [1]
[1] https://github.com/danthegoodman1/bloomsearch
sanskarix2 days ago
the beautiful thing about bloom filters is they let you say "definitely not here" without checking everything. that asymmetry is weirdly powerful for specific problems.
I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.
the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.
- adamzwasserman2 days ago
  Exactly. A 1.2% false positive rate means unnecessary reads 1.2% of the time vs 100% without the filter. Even at 10% FP rate, you skip 90% of I/O.
  This asymmetry works great for I/O-bound workloads (skip-indexes) but fails for TFA's approach where every document needs its own filter.
  In practice, you combine both: inverted index for the dictionary (amortizes across documents), then bloom filters per chunk of the index (amortizes across chunks). This two-level approach handles scale much better than TFA's one-filter-per-document design. It's bloom filters as an optimization layer, not a replacement.
- munchbunny2 days ago
  > that asymmetry is weirdly powerful for specific problems.
  100% agree when it works in your favor. We use it for exactly that situation where a non-zero false positive rate is fine and you can choose how much memory to devote to getting it closer to zero.
  There have a been a couple times though where we've needed a "keep everything not on this list" operation, and unfortunately bloom filters don't work well for that situation. There are alternatives, but none as elegantly compact as bloom filters.