I was looking at a blog post talking about the early days of Algolia’s search engine when I decided I wanted to try my own implementation of their algorithm.
https://github.com/iantbutler01/dart
dart = disk backed adaptive radix tree
The idea being hot paths stay in memory and cold paths get shunted to disk and since text search tends to have a pretty regular set of queries people make you get a really nice trade off for speed and storage.
The way delegation locks work is by having threads delegate their critical function instead of getting exclusive access to shared memory. Think of a client/server model, where the server executes other core's critical sections. For machines with very high core counts, the result is that the shared memory remains valid in the delegate's cache, instead of each new lock holder getting cache misses and fetching memory from other cores (or worse, other sockets).
https://people.csail.mit.edu/shanir/publications/Flat%20Comb...
> B-Trees won’t get obsolete
>
> <reaons>
Among the many reasons that b-trees aren't going away is that they are prefix indices. This means that an index on multiple columns -or on one column the values of whose type can have prefixes, like strings- can be searched with just a prefix of the key, which enables skip-scan type optimizations, and allows the index to be applicable to many more queries than a hash index. It also means that you can have covering indices as the actual table where the primary key columns are the prefix for the rest -- look ma'! no rowids!When they get out of whack we try some really crazy algorithms that make sense when the inequalities invert, and then disappear when order is restored.
One of the first times networking got faster than local storage, John Ousterhout (of Raft fame) and his group at Berkeley introduced a distributed operating system called Sprite (which featured heavily in the Distributed Computing class I took) that could not only run processes on different machines but could also do live process migration - in the 1980's. NICs getting faster than disk also brought us in-memory and disk-backed, in-memory KV stores, which haven't really gone away even with the introduction of NVMe storage.
B-trees fit well with the dominant equation, so they never really go away.
If we're talking caches and Redis, one thought about why they haven't gone away--
In practice I think sharing is part of what keeps them network services, not only access to a cluster's RAM. As a dev you reach for something shared and 'shaped' like a cache, and the in-RAM aspect may be vestigial but still comes with the package.
We use Redis when e.g. tracking stats on active client IPs to see which are up to shenanigans, a maybe common use where you want cluster-wide stats. When caching, we may still want one DB query per expiry period app-wide, not one per webserver.
We would probably be fine 'caching' some stuff in the DB; Django's cache package has supported it for a while! I'm sure one of the disk-using successors of an in-RAM store could hit perf targets (DragonflyDB looks cool!). Just not a priority to change stuff that's working. (Corollary: I bet some folks unhappy with their Redis RAM needs are more than happy to try Dragonfly.)
I'm not sure if that's why, but in practice, Postgres doesn't care too much for hash indexes. It technically has them, but they're non-default, not recommended, previously poorly supported, etc. The default is btree (and serial pk).
Many (though not all) databases allow the base data to be a heap for those situations where a pure heap or a heap with small supporting indexes is more efficient (very wide data for instance, particularly in intermediate structures for ETL processes (a longer process acronym might better describe this, say, ELTEL)).
elaborate? how it is different, than say, MySQL?
EDIT: Oops, yes, I meant Height-Optimized Trie! Sorry about that.
Ideally I'd like to find a way to store a non overlapping ordered list of intervals while being able to atomically (wrt other insertions) avoid inserting overlaps. This is part of a data structure tracking locked ranges, so being able to update them in parallel would be nice, since it today takes a lock over the whole tree while inserting new locked ranges.
Probably, this would require something with next value pointers. Maybe I could do some kind of scheme where to insert you have to first optimistically lock the predecessor with the to be inserted node, and then insert the new node where it belongs?
It's the data structure used to track non-overlapping intervals in the Linux kernel's virtual memory subsystem.
If you don't mind sharing, what's your use case for such a data structure?
Having said that, both PostgreSQL and Microsoft SQL has 8KB page size, while original SQLite had 1KB pages, before switching to 4KB, while still supporting 64KB pages. On top of that, Microsoft SQL operates internally on extents of 8 pages (64 KB) instead of a single page, but still pretends to use 4 KB pages [2].
In other words - no idea.
[1] https://www.postgresql.org/docs/current/storage-page-layout....
[2] https://learn.microsoft.com/en-us/sql/relational-databases/p...
[0] There's even some cool research about B-trees that have the layout depend on the probabilistic distribution of the lookups: https://dl.acm.org/doi/10.1145/3592980.3595316
Maybe that’s why ?
That said, there is no universal page size one can choose from. Some workloads will benefit from smaller page size while others will benefit from larger page size but 512K is not the size you will want to choose. Read-, write-, space-amplification, CPU cache thrashing etc.
bcachefs uses shared/intent/exclusive locks on nodes, so we only have to take write locks during transaction commit when we're doing the final update, and percpu read locks for interior nodes (and others).
this means that there's basically zero lock contention on interior nodes, or even cacheline bouncing. lock contention only comes up when you've got different keys in the same leaf nodes.
I'm curious about the per-CPU locks you mention though, do you mean that e.g., nodes are owned by certain CPUs and can only be written from those CPUs (and hence can get away with a CPU local lock?), or something else?
Updates to the b-tree in memory require locks too :)
> I'm curious about the per-CPU locks you mention though, do you mean that e.g., nodes are owned by certain CPUs and can only be written from those CPUs (and hence can get away with a CPU local lock?), or something else?
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/...
This is the code for btree node locks, as you can see there's a mode where the reader count is stored in a percpu counter. It makes it more expensive to take a write lock, but it means there's zero cacheline contention when just taking read locks (which is by far the common case for interior nodes!).
(kidding of course, i do)
edit: just realized this is _the_ Kent Overstreet!
Although my original post was in jest, thank you so much for all your hard work on bcachefs.
bcachefs has personally saved me so much time with its copy-on-write functionality, and for that i am very grateful
I never ask myself why we still use hashmaps or heaps or whatnot, so makes me wonder if this is really an article about why Cedar does not use something else? (LSM-trees the elephant in the room?)
splay tree are good if you are not accessing concurrently and ordered items. next item always be in root
Wraparounds are only a theoretical issue. There would have to be exactly UINT64_MAX writers between a reader first checking the version and verifying the version.
In the well-known access method described here, the writers access the shared data with mutual exclusion, i.e. they use locking.
The readers use no locking, but they access the shared data concurrently and optimistically, hoping that the access will succeed at the first attempt.
When the readers are unlucky, they must retry the access.
So there is locking used by writers for mutual exclusion and there is optimistic access used by the readers.
There is no "optimistic locking", which is a contradiction in terms (locking is pessimistic).
In general, there are only 3 methods for accessing shared data: mutual exclusion (a.k.a. pessimistic access), where locking forces the accesses to be sequential, and 2 methods where accesses may be concurrent, optimistic access (a.k.a. lock-free), where retries may be necessary, and dynamic partitioning of the shared data (typically used for shared arrays or for shared buffers/queues), where neither locking nor retries are needed.
The method described here for accessing B-trees employs a combination of all 3 methods, because the release of the locks at higher levels is a consequence of restricting the future accesses to only a part of the shared data, i.e. the writers that access the shared B-tree start by accessing sequentially the root, but then they partition the tree between themselves, so the next accesses that fall in distinct subtrees may proceed concurrently.
Understanding the difference between pessimistic access (mutual exclusion implemented by locking) and optimistic access (concurrent accesses with retries when necessary) is absolutely critical for the correct and efficient implementation of algorithms that use shared data structures.
It is frequent to combine both methods in an algorithm, like here, but in English that is not "optimistic locking", but at most "locking and optimistic access" or "optimism and locking".
Pessimistic access means that you expect that another process will attempt to access concurrently the shared data, so you must use a lock to prevent this. Optimistic access means that you expect that no other process will attempt to access concurrently the shared data, so you may proceed to access it immediately, but then you must have some means to detect that your assumption has been wrong and another process has interfered, when the transaction must be retried.
Depending on the application, either pessimistic access or optimistic access results in a better performance, neither is always better than the other. Optimistic access (lock-free access) makes better the best case, but it makes much worse the worst case. Depending on the frequency distribution of such cases optimistic access increases or decreases the performance.
Pessimistic access and optimistic access have the advantage of being applicable to any kind of shared data structure, but dynamic data partitioning, where applicable, like for this shared B-tree example, which can be partitioned in sub-trees accessed concurrently, normally results in better performance than both pessimistic access and optimistic access, by being deterministic and avoiding both locking and retries. Dynamic data partitioning may require locking for a very short time in order to partition the shared resource before accessing the allocated part, though the mutual exclusion provided by atomic instructions may be sufficient for this purpose. It is frequent that an atomic fetch-and-add instruction is enough to partition a shared data structure, like a shared array or a shared message queue, between concurrent processes attempting to access it.
Other than that, the version checks + retries have been a thing since forever (they are the most bog standard way to do any lock-free datastructures, along with database updates in the same manner). They do need a back off, though.