However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.
With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.
The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.
A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.
This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.
Bloom filter indexing seems like a great fit if you ever need to do substring searches in a context like that, and for log searching in general. I haven't dug into what all packages have it, but it looks like at least ClickHouse does: https://clickhouse.com/docs/optimize/skipping-indexes#bloom-...
Several of us were working hard to move everything into Prometheus that made any sense to be in Prometheus instead of Splunk.
Notably any time we had a production issue that it was unclear which team was responsible, Splunk became the bottleneck because we started exceeding quotas immediately.
I'm not sure what you mean by exceed quotas because Splunk is normally licensed on GB ingested per day. This can lead to bitter fights between teams over how this is allocated.
The good thing about this license model is that you can use as much hardware as you want for no extra license cost.
That’s sounds like self hosting. Which is not the only product they offer. But you still have hardware that can only run so many queries at once and then starts queuing any additional request, yeah? Once you have a dozen people on a call it went to shit. Only occasionally ran into problems like this with graphite. But you need a lot of people looking at a very large dashboard to start feeling refresh delays.
Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.
A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.
a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw
The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.
I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.
Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e
Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.
But it works and loading a hand full of files or doing an AND is amazingly fast.
My boss was generally pretty good with obscure data structures but neither of us had encountered Bloom filters. This was about three years after Google published their paper on how they were using Bloom filters, but that company would be bankrupt before I figured it out.
I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].
[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...
If you want guaranteed linear performance in the worst case too, then LinearHashchain is the one to use. It is slightly more complex to implement as it builds in a KMP style verifier to make it linear (so almost two search algorithms are needed). It is actually about as fast as HashChain generally in the average case, so you don't lose out.
The others are either experimental or quite niche and not suitable for most purposes. SentinelHashchain is actually the fastest, but relies on being able to add a copy of the pattern at the end of the search text. Mostly this won't be possible in most search contexts, unless you control all memory allocations.
So I'd start with HashChain, and maybe play with the linear version later - most of it is the same, it just needs a bit more adding.
It all started when Stavros's blog was circulated on Hacker News! The way we approached the search part was by using "Spectral Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi Matias from the early 2000s - its basically an iteration on the counting bloom filters. We used the minimal selection and minimal increase algorithm from the paper for insertion and ranking of results.
I wrote a blog on it too - https://pncnmnp.github.io/blogs/spectral-bloom-filters.html
Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf
I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.
Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.
TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.
It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.
I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).
There's also code online here: https://github.com/bingmann/cobs
You need to know up front whether you need to be able to dynamically add entries to the filter or if your application can tolerate rebuilding the filter entirely whenever the underlying data changes. In the latter case you have more freedom to choose data structures; many of the modern "better than Bloom" filters are more compact but don't support dynamic updates.
[1] https://en.wikipedia.org/wiki/Cuckoo_filter
[2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-filt...
Cuckoo claims 70% of the size of bloom for the same error rate, and the space is logarithmic to the error rate. Looks like about 6.6 bits per record versus 9.56 bits for bloom at 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you can get to about a .13% error rate for a cuckoo only a hair larger than the equivalent bloom filter (n^9.567 = 758.5)
My fairly niche use case for these kinds of data structures was hardware firewalls running mostly on SRAM, which needed a sub one-in-a-billion false positive rate.
I thought that was going to be a link to Google.com
It is not hard to see how you could start asking the front desk to track every obscure attribute and to expect to fall over for various reasons.
I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.
the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.
This asymmetry works great for I/O-bound workloads (skip-indexes) but fails for TFA's approach where every document needs its own filter.
In practice, you combine both: inverted index for the dictionary (amortizes across documents), then bloom filters per chunk of the index (amortizes across chunks). This two-level approach handles scale much better than TFA's one-filter-per-document design. It's bloom filters as an optimization layer, not a replacement.
100% agree when it works in your favor. We use it for exactly that situation where a non-zero false positive rate is fine and you can choose how much memory to devote to getting it closer to zero.
There have a been a couple times though where we've needed a "keep everything not on this list" operation, and unfortunately bloom filters don't work well for that situation. There are alternatives, but none as elegantly compact as bloom filters.