171 pointsby uds55012 days ago11 comments
  • ianbutler2 days ago
    Ah optimistic locking. I implemented that on top of a radix tree to make a concurrent disk backed adaptive radix tree for text search in Rust.

    I was looking at a blog post talking about the early days of Algolia’s search engine when I decided I wanted to try my own implementation of their algorithm.

    https://github.com/iantbutler01/dart

    dart = disk backed adaptive radix tree

    The idea being hot paths stay in memory and cold paths get shunted to disk and since text search tends to have a pretty regular set of queries people make you get a really nice trade off for speed and storage.

  • msanlop2 days ago
    Nice! I actually worked on this recently for my bachelor project. We got some promising preliminary results that showed performance gains on B-trees by using "delegation" on top of optimistic lock coupling.

    The way delegation locks work is by having threads delegate their critical function instead of getting exclusive access to shared memory. Think of a client/server model, where the server executes other core's critical sections. For machines with very high core counts, the result is that the shared memory remains valid in the delegate's cache, instead of each new lock holder getting cache misses and fetching memory from other cores (or worse, other sockets).

    • senderistaa day ago
      • msanlopa day ago
        Indeed, generally delegation has become to mean that there is a dedicated server thread that executes critical sections. Where as combining indicates that any thread can become a combiner while acquiring a lock, for a certain period of time. For the project, we used queue-based combining, which seems to be similar to the the lock in the paper.
  • cryptonector2 days ago

      > B-Trees won’t get obsolete
      > 
      > <reaons>
    
    Among the many reasons that b-trees aren't going away is that they are prefix indices. This means that an index on multiple columns -or on one column the values of whose type can have prefixes, like strings- can be searched with just a prefix of the key, which enables skip-scan type optimizations, and allows the index to be applicable to many more queries than a hash index. It also means that you can have covering indices as the actual table where the primary key columns are the prefix for the rest -- look ma'! no rowids!
    • hinkley15 hours ago
      Software architecture is dominated by a few inequalities that tend to remain in the same order most of the time, get closer or farther away from each other, and only occasionally reorder themselves because some big breakthrough in one kind of hardware has no equivalence in another and things get out of whack.

      When they get out of whack we try some really crazy algorithms that make sense when the inequalities invert, and then disappear when order is restored.

      One of the first times networking got faster than local storage, John Ousterhout (of Raft fame) and his group at Berkeley introduced a distributed operating system called Sprite (which featured heavily in the Distributed Computing class I took) that could not only run processes on different machines but could also do live process migration - in the 1980's. NICs getting faster than disk also brought us in-memory and disk-backed, in-memory KV stores, which haven't really gone away even with the introduction of NVMe storage.

      B-trees fit well with the dominant equation, so they never really go away.

      • twotwotwo7 hours ago
        > in-memory and disk-backed, in-memory KV stores, which haven't really gone away even with the introduction of NVMe storage

        If we're talking caches and Redis, one thought about why they haven't gone away--

        In practice I think sharing is part of what keeps them network services, not only access to a cluster's RAM. As a dev you reach for something shared and 'shaped' like a cache, and the in-RAM aspect may be vestigial but still comes with the package.

        We use Redis when e.g. tracking stats on active client IPs to see which are up to shenanigans, a maybe common use where you want cluster-wide stats. When caching, we may still want one DB query per expiry period app-wide, not one per webserver.

        We would probably be fine 'caching' some stuff in the DB; Django's cache package has supported it for a while! I'm sure one of the disk-using successors of an in-RAM store could hit perf targets (DragonflyDB looks cool!). Just not a priority to change stuff that's working. (Corollary: I bet some folks unhappy with their Redis RAM needs are more than happy to try Dragonfly.)

    • hot_gril2 days ago
      Also if you have roughly increasing row IDs, theoretically a btree is faster to insert into at a large scale, even if you don't care for prefix lookups or inequalities. And I think faster to read from.

      I'm not sure if that's why, but in practice, Postgres doesn't care too much for hash indexes. It technically has them, but they're non-default, not recommended, previously poorly supported, etc. The default is btree (and serial pk).

      • cryptonectora day ago
        But tables in PG are heaps, so appending to them is even faster than in b-trees.
        • dspilletta day ago
          This is specifically taking about indexed structures though, so inserting data in a way that means it can be efficiently searched afterwards. Comparing heaps to b-trees, hash tables, and other ones structures in this context isn't really a Granny Smith Vs Golden Delicious style comparison.

          Many (though not all) databases allow the base data to be a heap for those situations where a pure heap or a heap with small supporting indexes is more efficient (very wide data for instance, particularly in intermediate structures for ETL processes (a longer process acronym might better describe this, say, ELTEL)).

        • avinassha day ago
          > But tables in PG are heaps, so appending to them is even faster than in b-trees.

          elaborate? how it is different, than say, MySQL?

  • foota2 days ago
    Funny timing, I've been looking at this recently. Some other well performing concurrent ordered data structures are ART (adaptive radix trees) and masstree. Skiplists are common since they're easier to implement lock free, but they don't seem to perform well relative to these alternatives.
    • judofyr2 days ago
      You should check out Height-Oriented Tries as well. Their benchmarks shows that it outperforms both ART and masstree.

      EDIT: Oops, yes, I meant Height-Optimized Trie! Sorry about that.

      • foota2 days ago
        Will take a look, thanks! I think you meant height-optimized tries, since I had a tough time giving it at first :)

        Ideally I'd like to find a way to store a non overlapping ordered list of intervals while being able to atomically (wrt other insertions) avoid inserting overlaps. This is part of a data structure tracking locked ranges, so being able to update them in parallel would be nice, since it today takes a lock over the whole tree while inserting new locked ranges.

        Probably, this would require something with next value pointers. Maybe I could do some kind of scheme where to insert you have to first optimistically lock the predecessor with the to be inserted node, and then insert the new node where it belongs?

        • mada day ago
          You should also check out the Maple Tree (https://docs.kernel.org/core-api/maple_tree.html).

          It's the data structure used to track non-overlapping intervals in the Linux kernel's virtual memory subsystem.

          If you don't mind sharing, what's your use case for such a data structure?

  • jrockway2 days ago
    Something I always wonder about is figuring out the optimal page size. The author seems to use 64KB. This is neither the size of a CPU cache line (often 64 bytes) nor the size of a block on an SSD (often 512KB). So why 64KB?
    • magnat2 days ago
      If your data/leaf/heap pages hold indexed cells PostgreSQL-style [1], 16-bit in-page offset limits your page size to 64KB. On the other hand, physical memory frame size usually is 4KB or larger, so going below that is not really productive.

      Having said that, both PostgreSQL and Microsoft SQL has 8KB page size, while original SQLite had 1KB pages, before switching to 4KB, while still supporting 64KB pages. On top of that, Microsoft SQL operates internally on extents of 8 pages (64 KB) instead of a single page, but still pretends to use 4 KB pages [2].

      In other words - no idea.

      [1] https://www.postgresql.org/docs/current/storage-page-layout....

      [2] https://learn.microsoft.com/en-us/sql/relational-databases/p...

      • msanlop2 days ago
        I tried looking for papers on the subject but couldn't find much. There recent research about in-memory B-tree, and there is has been shown that smaller pages (sizes as low as 256/512B) are more performant due to better cache behavior[0]. The general wisdom seems to be that disk-based databases' IO performance is the main bottleneck—but again I couldn't find any concrete data/benchmarks about why those higher sizes were chosen.

        [0] There's even some cool research about B-trees that have the layout depend on the probabilistic distribution of the lookups: https://dl.acm.org/doi/10.1145/3592980.3595316

      • sroussey2 days ago
        The default macOS page size is currently 16KB, and various (Apple) stuff is optimized for it, and I think SQLite is among them (not 100%). I think their server optimized builds for internal use are 64.
      • o11ca day ago
        16-bit offsets doesn't limit you to 64K unless you require byte-alignment. 128K or 256K is usually trivial and even 512K or 1M is often viable.
    • shinycode2 days ago
      From the article : « The 64KB root node of our B-Tree fits nicely into the L1/L2 cache of a modern CPU »

      Maybe that’s why ?

      • jrockwaya day ago
        I'm just looking for more depth. Like, is that all loaded in 1 load operation when the page is touched and is therefore optimal? Does disk access / writing dirty pages not matter?
        • menaerusa day ago
          No, 64K is definitely not loaded in 1 load operation. Maximum amount of load operations modern CPUs can do is 2x512-bit or 128 bytes per cycle or around ~600 GB of L1 bandwidth.

          That said, there is no universal page size one can choose from. Some workloads will benefit from smaller page size while others will benefit from larger page size but 512K is not the size you will want to choose. Read-, write-, space-amplification, CPU cache thrashing etc.

    • remexre2 days ago
      64KiB also happens to be the largest minimum page size you can configure reasonably common hardware to have (not that this is a common configuration).
    • tomnipotent2 days ago
      Cedar uses a PAX storage layout (hybrid row/column) so I imagine 64KB optimizes well for OLAP workloads while keeping excess disk I/O for point look-ups manageable. Depending on your OLAP load that "excess" I/O could either be a rounding error or a bottleneck.
    • a u16 can address it, which is nice. it's also the minimum virtual address allocation on windows. not sure if either of those are the reason, in this case
  • koverstreet2 days ago
    kids stuff :)

    bcachefs uses shared/intent/exclusive locks on nodes, so we only have to take write locks during transaction commit when we're doing the final update, and percpu read locks for interior nodes (and others).

    this means that there's basically zero lock contention on interior nodes, or even cacheline bouncing. lock contention only comes up when you've got different keys in the same leaf nodes.

    • footaa day ago
      I'm not familiar with bcachefs, so maybe I'm misunderstanding, but the b-tree discussed here also doesn't need to talk locks except for when data is written?

      I'm curious about the per-CPU locks you mention though, do you mean that e.g., nodes are owned by certain CPUs and can only be written from those CPUs (and hence can get away with a CPU local lock?), or something else?

      • koverstreeta day ago
        > I'm not familiar with bcachefs, so maybe I'm misunderstanding, but the b-tree discussed here also doesn't need to talk locks except for when data is written?

        Updates to the b-tree in memory require locks too :)

        > I'm curious about the per-CPU locks you mention though, do you mean that e.g., nodes are owned by certain CPUs and can only be written from those CPUs (and hence can get away with a CPU local lock?), or something else?

        https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/...

        This is the code for btree node locks, as you can see there's a mode where the reader count is stored in a percpu counter. It makes it more expensive to take a write lock, but it means there's zero cacheline contention when just taking read locks (which is by far the common case for interior nodes!).

    • nullpoint4202 days ago
      bcachefs may use zero lock contention, but does anyone use bcachefs?

      (kidding of course, i do)

      edit: just realized this is _the_ Kent Overstreet!

      Although my original post was in jest, thank you so much for all your hard work on bcachefs.

      bcachefs has personally saved me so much time with its copy-on-write functionality, and for that i am very grateful

      • footaa day ago
        My new hobby, labelling my unused code as zero contention, because if it's not running there's zero contention :-)
  • emmanueloga_2 days ago
    I wonder about the angle of the article, starting with "B-Trees stand the test of time" ending with "everything else has seriously diminishing returns".

    I never ask myself why we still use hashmaps or heaps or whatnot, so makes me wonder if this is really an article about why Cedar does not use something else? (LSM-trees the elephant in the room?)

    • footaa day ago
      This is primarily for in-memory data. You need something for the base layer of a LSM-tree no matter how you do it.
  • NightMKodera day ago
    Talking of trees and caches, back in school I remember learning about splay trees. I’ve never actually seen one used in a production system though, I assume because of the poor concurrency. Has anyone heard of any systems with tree rebalancing based on the workload (ie reads too not just writes)?
    • selamtuxa day ago
      maybe weighted trees? hot paths are lower weight so it will be more close the root.

      splay tree are good if you are not accessing concurrently and ordered items. next item always be in root

  • vander_elsta day ago
    A bit of a tangential question. Does anyone have any experience with cedars?
  • uds55012 days ago
    [not-author] but found it really fascinating. My open questions - is the version monotonically increasing though? - how would one handle wrap arounds once the versions don't fit a datatype? - who decides the version to be assigned to a thread's attempt?
    • lvogel2 days ago
      The version is just an atomic integer assigned to that btree node which is monotonically increasing. Each writer increases the version when it releases the lock IF it has modified the node.

      Wraparounds are only a theoretical issue. There would have to be exactly UINT64_MAX writers between a reader first checking the version and verifying the version.

      • adrian_b2 days ago
        "Optimistic locking" is a bad choice of words because no locking is optimistic.

        In the well-known access method described here, the writers access the shared data with mutual exclusion, i.e. they use locking.

        The readers use no locking, but they access the shared data concurrently and optimistically, hoping that the access will succeed at the first attempt.

        When the readers are unlucky, they must retry the access.

        So there is locking used by writers for mutual exclusion and there is optimistic access used by the readers.

        There is no "optimistic locking", which is a contradiction in terms (locking is pessimistic).

        In general, there are only 3 methods for accessing shared data: mutual exclusion (a.k.a. pessimistic access), where locking forces the accesses to be sequential, and 2 methods where accesses may be concurrent, optimistic access (a.k.a. lock-free), where retries may be necessary, and dynamic partitioning of the shared data (typically used for shared arrays or for shared buffers/queues), where neither locking nor retries are needed.

        The method described here for accessing B-trees employs a combination of all 3 methods, because the release of the locks at higher levels is a consequence of restricting the future accesses to only a part of the shared data, i.e. the writers that access the shared B-tree start by accessing sequentially the root, but then they partition the tree between themselves, so the next accesses that fall in distinct subtrees may proceed concurrently.

        • layer82 days ago
          “Optimistic locking” is a well-established and widespread terminology though, not the least due to its catchiness. The more factual, but unwieldy term is “optimistic concurrency control”.
          • adrian_ba day ago
            I agree that it is widespread, but whenever you see such illogical terms you have to wonder whether the authors who use them do not understand what they are really doing or they understand, but they succumb to conforming with a widespread inappropriate usage in the hope to be better understood by naive readers.

            Understanding the difference between pessimistic access (mutual exclusion implemented by locking) and optimistic access (concurrent accesses with retries when necessary) is absolutely critical for the correct and efficient implementation of algorithms that use shared data structures.

            It is frequent to combine both methods in an algorithm, like here, but in English that is not "optimistic locking", but at most "locking and optimistic access" or "optimism and locking".

            Pessimistic access means that you expect that another process will attempt to access concurrently the shared data, so you must use a lock to prevent this. Optimistic access means that you expect that no other process will attempt to access concurrently the shared data, so you may proceed to access it immediately, but then you must have some means to detect that your assumption has been wrong and another process has interfered, when the transaction must be retried.

            Depending on the application, either pessimistic access or optimistic access results in a better performance, neither is always better than the other. Optimistic access (lock-free access) makes better the best case, but it makes much worse the worst case. Depending on the frequency distribution of such cases optimistic access increases or decreases the performance.

            Pessimistic access and optimistic access have the advantage of being applicable to any kind of shared data structure, but dynamic data partitioning, where applicable, like for this shared B-tree example, which can be partitioned in sub-trees accessed concurrently, normally results in better performance than both pessimistic access and optimistic access, by being deterministic and avoiding both locking and retries. Dynamic data partitioning may require locking for a very short time in order to partition the shared resource before accessing the allocated part, though the mutual exclusion provided by atomic instructions may be sufficient for this purpose. It is frequent that an atomic fetch-and-add instruction is enough to partition a shared data structure, like a shared array or a shared message queue, between concurrent processes attempting to access it.

      • xxs2 days ago
        I chuckle every time anyone has tried to handle long(64bit) overflows while incrementing by one.

        Other than that, the version checks + retries have been a thing since forever (they are the most bog standard way to do any lock-free datastructures, along with database updates in the same manner). They do need a back off, though.

      • pfent2 days ago
        When incrementing the version with 4 GHz, it takes over 100 years non-stop for a 64bit wraparound.
        • xxs2 days ago
          likely even more as it has to be an atomic operation which causes extra latency and coherency traffic
  • kikimoraa day ago
    How do you implement a concurrency-safe copy operation? Especially if data structure you copy is 64Kb.
    • mrkeena day ago
      The first question I always ask about concurrency is: can I exploit immutability?

      Once something doesn't change, your copy is a no-op, so size doesn't really matter.