Designing a Fast Concurrent Hash Table(ibraheem.ca)

189 pointsby burntsushi9 months ago8 comments

eterm9 months ago
I'd love to see this put through it's paces with the following:
I have a strategy for benchmarking such data-structures, I call it the "birthday benchmark".
Unfortunately I'm not proficient enough in Rust to have put it into practice there, but the strategy is this:
Generate a stream of random bytes, pre-generation is advised to avoid dominating the benchmark.
Consume the stream in blocks of n bytes.
Try to find a duplicated block.
Given the "birthday paradox / birthday attack", this is actually much quicker than it might first appear. Simple single-thread basic hashset can find duplicates at a 6-byte length match in around 3 seconds on my hardware, and a specialised data-structure takes this to less than a second.
A good concurrent hashtable should improve that greatly even further, because this ought to be a very parallelizable problem, especially if you're not constrained to finding the first such duplicate in the stream, but allow yourself to find any duplicate, nor are constrained to keep track of both sides of the pair, and are content with simply knowing you have found one.
- vlmutolo9 months ago
  > Generate a stream of random bytes, pre-generation is advised to avoid dominating the benchmark.
  Current PRNGs are pretty fast. The Xoroshiro RNG "shootout" benchmark [0] lists some self-reported speeds. They claim 8+ GB/s for even their slowest, highest-quality PRNG. The general-purpose one they recommend is 10GB/s, and 32GB/s when vectorized.
  The vectorized versions get close to current DRAM speeds. I think I'd prefer that over reading from a giant table, given that the table reads will have significant implications for caching and disrupt that aspect of the benchmark.
  [0]: https://prng.di.unimi.it/#shootout
  - eterm9 months ago
    Thanks for that suggestion, I shall explore faster PRNGS and vactorization too then.
    My domain is c#, where it's perhaps unsurprising that it's a lot faster to have an array in memory than go via the default Random, which is I believe is Xoroshiro ( at least in .net 6+ ). It certainly can generate data quickly with Random.GetBytes(), but repeated calls to Random.NextInt64() are much slower.
    Another issue I found with generating random numbers and trying to multi-thread is thread safety / synchronisation. If each thread is doing it's own random generation then it's difficult to compare results between using 1 thread and 2 threads, because it's no longer the same single stream of random data measured across each benchmark, so the meta-randomness of how early you "should" hit a dupe becomes a factor.
    Having pre-generated numbers and single thread responsible for handing out those numbers makes it easier, you can either have a thread-safe counter, or can increment a larger increment plus an offset for each thread number. The first "real" dupe is still in the same place in the original pre-generated data.
    You can then compare the incremental gains, for example, of each thread having their own hashtable would get you logarithmic gains (I think that's the right way to express it, but essentially it's gaining just by having First Dupe = Min(First Dupe across N threads) vs an actual thread-safe concurrent hash-table, where you should[1] still see First dupe but sped up by a greater factor than just a naive work but not state sharing.
    I recognise there are potential memory caching issues at play with the pregeneration approach, but for larger bit counts the actual work should hopefully dominate, particularly since look-up values aren't revisited by design.
    [1] "Should", because there's a small chance in many algorithms that due to concurrency and race conditions you actually miss the duplicate. Either multiple runs should be tried looking for bi-modal data, or accept that the next dupe shouldn't be so long after the first, and be happy the speed-up is greater than the slow-down caused by very occassional race condition misses. The chance of a such a race condition is absolutely minuscule at the 48bit level, if we assume we are using say, 8 threads, and assume for a race condition to occur, the same 48-number would have to be concurrently being handled/generated by 2 different threads.
    o11c9 months ago
    > If each thread is doing it's own random generation then it's difficult to compare results between using 1 thread and 2 threads, because it's no longer the same single stream of random data
    Most modern RNGs support a "jump" operation that takes care of this; PCG has the best standard API I've ever seen. Alternatively, with any RNG you can interleave (call `next` numthreads times before selecting a number) but that's slower.
    Iwan-Zotow9 months ago
    > which is I believe is Xoroshiro ( at least in .net 6+ )
    Xoshiro I believe is in .NET 6, close cousin
    https://blogs.siliconorchid.com/post/coding-inspiration/rand...
    eterm9 months ago
    Thanks for that correction, I hadn't appreciated the subtle difference there.
    mananaysiempre9 months ago
    In my experience, vectorization is very simple—just running multiple instsnces of Xoroshiro (or one of the weaker variants) inside a vector works quite well. Fitting the resulting peg into the hole of a preexisting API is difficult, but that’s always a problem.
- bloppe9 months ago
  This is a write-heavy workload. Papaya is optimized for read-heavy workloads.
  It's an interesting benchmark, but the kind of people who would want to use Papaya probably wouldn't be very interested in it.
- winwang9 months ago
  Do you have other examples of interesting workloads? Benchmarking is difficult, lol.
_nalply9 months ago
Have also a look at concurrent_map. It is a concurrent BTreeMap, this means it maintains key order (not insertion order but the keys implement Ord).
https://lib.rs/crates/concurrent-map
I am using it for a transactional in-memory key-value store experiment.
- ibraheemdev9 months ago
  Looks very interesting, but seems to serve a pretty different use case:
  > This is an ordered data structure, and supports very high throughput iteration over lexicographically sorted ranges of values. If you are looking for simple point operation performance, you may find a better option among one of the many concurrent hashmap implementations that are floating around. Pay for what you actually use :)
- CyberDildonics9 months ago
  If you just need a key-value data concurrent data structure it should be much faster and scale better to have a hash map instead of something that is keeping a sorted order.
  - _nalply9 months ago
    Different use cases. An ordered map shines if you need some ordering of keys. One use case is an ordered index, for example if you have an age index and need to list only adult people. A hashmap won't work here, I think.
    So just use a DBMS, but if you don't want, perhaps this ordered concurrent map.
    CyberDildonics9 months ago
    Did you think my comment said that there is no use for an ordered map? You seem to be listing basic uses for a fundamental data structure.
    You said you were using it for a "key-value store". If all you need is a key value lookup, all you need is a hash map.
    When it comes to concurrency the difference is even more pronounced because hash maps can have lots of concurrent IO due to working on different parts of memory that are mostly independent of each other while a sorted data structure is going to have to move things around, which means more atomic swapping, which means more chances for collisions or race conditions where other threads are going to have to retry their compare and swaps or back off and try parts of their operation again.
rurban9 months ago
This looks pretty good in terms of its tradeoffs and tricks used. Makes good use of the Metadata, as in https://greg7mdp.github.io/parallel-hashmap/
mareko9 months ago
What a fascinating and detailed post. Thanks so much for sharing!
I wonder whether it's possible to achieve similar performance but with a simpler design that uses scalable reader-writer locks. They can be made a lot more efficient than what's commonly out there using the C-SNZI lock-free data structure:
https://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.p...
winwang9 months ago
Are there "batch"-esque workloads where we want extreme throughput, but can tolerate a large latency (10-100 micros)?
- fanf29 months ago
  There’s an interesting optimization along those lines in Valkey https://valkey.io/blog/unlock-one-million-rps-part2/
mrkeen9 months ago
What makes it "concurrent" and "feature-complete" ?
Can I reason about it in a concurrent scenario the same way I'd reason about it in a single-threaded scenario?
- tialaramex9 months ago
  For "feature complete" I assume they mean that they've implemented all the stable features of the similar container in the standard library, I did not verify this but it seems like a reasonable understanding.
  You can reason about its safety/ correctness the same way as the stdlib container types and this will be true for all of (safe) Rust.
  The performance considerations have a new dimension for multi-threading though. In single threading you don't need to care whether doing similar operations "at the same time" could affect the performance of what you're doing, but in a concurrent system that's a consideration. Accordingly you would need to benchmark this carefully for your specific usage - just because it's correct doesn't mean it's fast or even acceptable.
- dragontamer9 months ago
  Concurrent data structures IMO are those that take advantage of their concurrency and (subtly) alter their behavior for +performance above and beyond what is possible with simple locks.
  For example: Count / size() can be imprecise in the realm of concurrency. No need to exactly count the HashTable.
  It's impossible to get the size of this table exactly, because the shards are each at best one atomic and no shared lock exists between them. Adding up their individual counts (even if implemented as seq-cst atomics) is incorrect as an atomic is only globally correct within its own atomic operation. And there's no way to atomically gather all the shards sizes without a global lock.
  Feature complete would similarly be: all the features you'd expect, except those annoying low performance ones.
- ibraheemdev9 months ago
  It's quite common for concurrent algorithms to only implement a subset of operations. For example forgoing, removal or iteration. It's also common to put limitations on the data structure, such as limiting keys and values to 64-bits. Papaya being feature-complete means that it does not have any of these limitations when compared to std::collections::HashMap.
hi-v-rocknroll9 months ago
C++ STL/Boost and/or Python probably already have such a beast that could be adapted/cannibalized for some use-cases.
gleenn9 months ago
"In some ways, they are the holy grail of concurrent data structures. On the other hand, a concurrent hash table is an inelegant blob of shared mutable data, often a marker of a poorly architectured program."
I find the last bit particularly objectionable. If you're in some language slanging objects around all day, then sticking a untyped, bespoke substitute for an object probably isn't the right move. But on the long path down my programming career, I recognize objects make so many things harder and more confusing than necessary. I definitely prescribe to Rich Hickey saying I would rather have 1 data structure and 100 functions that operate on it far easier to work with and understand than 10 data structure and 10 functions. Clojure absolutely is a joy to work with and reason about, and it only gets better the faster you get out of typed object land. Hashmaps are the purest abstraction over an associative data structure, and I will take one over a pile of classes with brittle, snowflake interfaces any day.
- magicalhippo9 months ago
  > classes with brittle, snowflake interfaces
  That is exactly the opposite experience that I have, to the point I feel like it's from a different universe.
  Perhaps difference in domain, but every time I have to use Python or JS I hate the lack of explicit types with defined interfaces as discoverability during coding is zero. And you end up with code that, after dependency updates or similar, you never know if it'll crash until you run it.
  - froh9 months ago
    in my experience there's always so much more to data structures that you can shoehorn into member functions.
    I agree type hints are a godsend for discoverability, but please, show me all operations this type can be a parameter for.
    I want "discoverability" on the argument lists of all libraries.
    and I find lsp with type hints useful for that.
    and multi methods! can I have multiple dispatch? but first parameter based dispatch "oop" only? meh.