I remember some strange code (such as pushing return values 4k above the stack, with a comment like "this works as long as the caller doesn't use more than 4k of stack space before accessing the return value"), and the author also shared some unconventional opinions about undefined behavior (like "Compilers are deterministic, if I know what platform I'm compiling to then no behavior is undefined. And if compiler authors disagree, they are morons.")
But presumably it's thoroughly tested, so those aren't problems in practice? Would be really interested to hear from people who've actually used it. I've mainly stuck to SQLite instead.
Edit: I also tried using it for larger blobs of data (like audio) but ended up only storing a reference to shared memory for larger blocks, anything larger than IIRC 4k that can't be stored in a single node kills performance, but for small values it seems pretty great.
And yes, I ensured there were no outstanding long lived readers, verified with mdb_stat -r. My workload used one transaction per read/write anyway (never needed larger atomicity). Once the db got into the bad state, running my program on it would almost immediately run into the issue again, so I really think the db is in a bad state such that most writes would cause it to hang, not related to how I do transactions. This workload would pretty consistently hit the issue once the db got to several hundred gb.
Issue #10236 on the OpenLDAP bug tracker might be the root cause, who knows. It's been marked CONFIRMED for years without a fix, while other similar issues are created.
This is extremely annoying. It seems workload dependent (other workloads I've run create absolutely massive lmdb dbs without this issue) and once it happens your only recourse is to make a new db and copy the contents over (thankfully reads still work fine on these borked dbs).
Other than that, though, it's great. Never in any case had actual data corruption, and reads and writes are extremely fast (until this issue happens)
Edit: fun fact, since shopify may have created Bolt in response to this bug, and then Bolt was the root cause of the 73-hour Roblox downtime in 2021, this bug may indirectly have caused one of the worst outages ever!
I had a situation where the web service's writes were slowing down to an unbearable crawl because the number of entries in the database were reaching tens of billions of entries. Thankfully, the users never experienced the slowness. The website stayed nice and fast, even though the background updates were extraordinarily slow. The issue was fixed by sharding the databases.
Then what’s the point of memory mapping in the fist place? Or do they suggest manual flush/sync actions for persistence.
It is really reliable except write performance in my experience.
Author of it writes very spicy stuff and sounds pretty rude.
I would recommend doing a prototype with real data scale and testing if it meets your requirements. The write performance can be really atrocious and It doesn't have a high performance potential because it is based on memmap.
I generally do think read-write mode would offer higher write performance than read only as well :)
> The memory map can be used as a read-only or read-write map.
So presumably lmdb writes to the database using the `pwrite` syscall by default, but can optionally write via the mmap instead - if you are willing to accept the increased risk of accidental data corruption.
- support for incremental backup
- support for page-level checksums and encryption
- support for DB on raw block devices
- support for 2-phase commit
- support for page sizes up to 64KB
plus other minor additions to the API.
Anyway, of course in case you feel the website is a risk, you should refrain from using it. Safety comes first.
You can seal memfds too, which means that the "read-only" mode is easy to implement: just map your memfd for write, apply F_SEAL_FUTURE_WRITE, and share the memfd to anyone you want to have read-only access.
By doing your own O_DIRECT IO instead of relying on the kernel's defaults, you get a lot more control. You choose how much readahead to do; you choose your read-cluster size. You choose your cache eviction strategy. You choose when to write back.
BTW: O_DIRECT can also be done asynchronously using aio or io_uring. There's no such thing as an asynchronous page fault. And IO errors? Would you rather deal with EIO or SIGBUS?
Why would you want the kernel to do these things for you? It'll do a worse job: it has less information than you do and has to use blunt heuristics that work sort-of-good-enough for the whole world, not just your program.
And it's not any faster either. O_DIRECT is DMA. A page cache fill is also DMA. It's the same operation, spelled differently.
Uncommonly used system calls give user-space programmers the sensation of learning something.
> Why would you want the kernel to do these things for you? It'll do a worse job: it has less information than you do and has to use blunt heuristics that work sort-of-good-enough for the whole world, not just your program.
Yes, you're opting into non-determinism you don't control. When resources get constrained and everything can't be in memory and someone asks you why the database sucks, all you'll be able to do is shrug. Anyone who builds critical systems would never rely on the kernel making decisions like this. Don't use LMDB for anything that matters.
Shrinker-BPF attached to a memfd perhaps? A BPF shrinker could not only choose which pages to evict in a non-stupid way, but could notify userspace in some sane manner (e.g. setting a bitmask somewhere) that it's done so.
(Zero-fill as "notification" is insane and doesn't actually work because zero is a perfectly valid value in a lot of contexts.)