Another paper of his that I really like combines a few elegant ideas into a simple implementation of bitvector rank/select: https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf
During this time I got really excited about succinct data structures and wrote a Rust library implementing many bitvector types and a wavelet matrix. [2]
My interest came from a data visualization perspective -- I was curious if space-efficient data structures could fundamentally improve the interactive exploration of large datasets on the client side. Happy to chat about that if anyone's curious.
[1] Paper: https://archive.yuri.is/pdfing/weighted_range_quantile_queri... though it's pretty hard to understand without some background context. I've been meaning to write a blog post explaining the core contribution, which is a simple tweak to one of Navarro's textbook data structures.
[2] The rust version is here: https://github.com/yurivish/made-of-bits/tree/main/rust-play... and an earlier pure-JS implementation is here: https://github.com/yurivish/made-of-bits/tree/main
I suspect some people are literally having conniption fits about this…
You can’t take a descriptive dictionary and then claim it is prescriptive.
IIRC both the OED and CED list figurative uses for the word, do you know any publications considered more authoritative than those for English? Webster too, for those who prefer simplified English.
The Académie do publish a dictionary every few decades though, there was a new edition recently, so there is a prescriptive dictionary for French even though it carries little weight in reality.
French is the only living language to attempt it to this extent, though the existence of one is enough to make my “there are none for living languages” point incorrect. It is difficult to pin a language down until no one really speaks it day-to-day (so it doesn't evolve at the rates commonly used languages do).
I'm not sure what would even make a dictionary prescriptive other than an explicit declaration that it is so or, ridiculously, a law declaring the same.
there are no 100% correct descriptive dictionaries. Any prescriptive dictionary is automatically correct.
… in the view of their compilers.
I could write a prescriptive dictionary far more easily than I could get others to accept it as correct.
Yes you would have to be involved with a regulatory institution first
The "literally" ship sailed centuries ago. Sorry, but that battle has been lost. Even so-called "prescriptive" dictionaries would be categorically incorrect if they ignore nearly three centuries of common vernacular.
Having said that, I will join you in this fight.
See also: exponentially.
Is that important? Well, maybe. As I started digging in and looking for libraries that implement some of this stuff, I found that a popular graph processing library (JGraphT) uses a succinct data structure library (Sux4J) for working with large graphs. And working with graphs is something I've been doing a deep dive into lately. So yeah, if these data structures represent something that has practical applications in graph processing, then maybe finding this is important.
Certainly I'm glad I stumbled across this in either case; the topic strikes me as fascinating.
Once you have an entire system written the benefits of having four interacting features that each fit into a third of memory may be bigger than you think. And I don't know how well Intel's hardware level profiling information reveals that but I know for sure that the profiling tools that ship with most commercially viable programming languages never do. They blame issues on whoever touched the system last, and if that is an epicycle or a metastable situation then the apparent guilty party may be a frame-up, while the real culprit goes unpunished.
As an example: if your workflow has a step that eventually needs to use 40% of available memory to solve a problem, then GC will almost always trigger within that step of the process, rather than in the other 60% of memory being systematically wasted by heaps of inefficient code in the leadup to this step, because the top of the memory sawtooth will almost always occur within that part of the call graph. But because the 40% is your intrinsic complexity, people's brains shut off the moment a cost is attributed to unavoidable work, instead of the avoidable work that really caused the majority of the problem.
Succinct datastructures are used in genomics (e.g. bwa, megahit exome sequencer) because N is so large that you're actually hitting asymptotic behavior.
For memory latency it can by making your memory footprint fit in LLC or a single node; cross-node NUMA latencies are typically enough to absolutely tank performance.
It can theoretically also help in massively parallel access situations where bandwidth becomes a limiting concern. Although, I intuit we would need near-memory hardware to perform the rank+select. Otherwise the latency of the multiple dependent accesses will kill your performance again, cfr previous point.
With a lot of parallel accesses, bandwidth could also be an issue in conventional structures.
I have used various bit-packing schemes in order to keep data in memory. If you can keep your entire dataset in memory, it opens up different ways to tackle it. Succinct data structures look like a way to enable this.
It's not storage access times that kill you. You need to rearrange all your processing to work in batches or windows or use some kind of query API to do anything. Everything becomes much more painful.
I can see it be useful for data sets of a few dozen MBs as well.
There is also HaskellWorks packages like https://hackage.haskell.org/package/hw-xml
Thank you for sharing this link!
"Intuitively, a succinct data structure is one whose space usage equals the space needed to write out the data, plus something that grows more slowly than that. If you're familiar with little-o notation, a succinct data structure is one whose space usage is X + o(X), where X is the number of bits needed to write out the data itself."
Brings to mind COBS encoding, which does this for streams bytes containing arbitrary length "packets" or similar.
Theoretically, this approach uses O(nlogn) bits as opposed to o(n) for the theoretical approach, but in practice, for <2^64 bools, the actual storage ens up being n/32+n/1024 which is pretty hard to beat. The theoretical approach gets it's wins from making extremely clever use of the difference between O(loglog(n)) and O(1), but unfortunately for the foreseeable future, logn < 64 and loglog(n) < 6, so all the subtlety gets swallowed up into the base case of a single popcnt instruction.
It feels kinda magic implementing these algorithms because everything becomes so tiny!
Together with his student (I also helped a bit), he wrote a paper about RecSplit, a minimal perfect hash function (MPHF) algorithm I have invented: https://arxiv.org/abs/1910.06416 - that algorithm uses around 1.56 bits per key. But it is quite slow. In January 2020 I presented the paper at a conference, that was right before the pandemic.
The algorithm with the least memory usage (and much faster as well) is now Consensus-RecSplit: https://arxiv.org/abs/2502.05613 - it can do 1.44 bits per key, which is right at the theoretical minimum (meaning, there is no good way to shrink it further). At least a small part of my original invention is still used there, nice. The fastest current MPHF is probably PtrHash https://arxiv.org/abs/2502.15539 - both papers were published last month (February 2025) by the way.
The wavelet matrix concept seems particularly promising for our text-heavy workloads. I'm curious if the constant-time operations actually hold up under real-world conditions or if there are hidden performance cliffs.
This feels like one of those CS concepts that should be more widely known but somehow got overlooked by mainstream programming. Kind of like how bloom filters were obscure until suddenly every system was using them.
Does your language have the concept of streaming files?
But when you get to hierarchical data structures like JSON/protobuf there very often simply isn't a streaming library available. There's a library function to decode the whole thing into an object in memory, and that's all.
Nothing prevents streaming in theory, it's just far more complicated to write that library.
e.g. standard go json library https://pkg.go.dev/encoding/json#example-Decoder.Decode-Stre...
Lesson for technologists: if you want to make the world a better place arrange your tech such that the lowest-friction path is also the correct path.
(Another example: disasterous multi-byte UTF encodings [correct solution was more friction] vs basically successful UTF8 [correct solution was less friction].)
I don't know if you're still dealing w/ this particular problem for protobufs, but based on my experience with thrift, a very similar library, there are probably some not too terrible ways you can kinda hack up the client-side parsing to be more streaming-ish...
[1]https://jpa.kapsi.fi/nanopb/docs/#features-and-limitations
It's very common that other programming languages have basic SAX parsers.
What are these languages that don't which you've encountered?
In Java, which I consider to have the best tooling for advanced XML applications, you'd look into JAXB on streams, StAX or SAX. On complicated and heavily nested XML it might take some effort and profiling to figure out the optimal state machines for exhaustive traversal, if that's what you do.
I'd also like to mention that XSLT is an often underappreciated approach.
I'm pretty sure these were used to store the built in dictionaries on early mobile phones, especially for the implementation of T9 word and similar programs.
Erik Demaine has a few great lectures on succinct data structures too: L17 and L18 on https://courses.csail.mit.edu/6.851/spring12/lectures/
The trick is that masking & popcount (both low-cycle CPU instructions in most modern CPUs¹) make this quite fast to access and thus suitable for in-memory representation.
The intended use is when presence of optional fields is known at allocation time and doesn't change afterwards, i.e. some object is built up into a dynamically allocated buffer whose size is shrunk down by omitting fields. Changing which fields are present afterwards requires reallocating the thing, which tends to make the entire pattern pointless.
¹ the real annoyance here is that almost all x86_64 CPUs have POPCNT, except the absolute earliest ones, but if you build e.g. some package for a Linux distribution without any CPU optimization flags it'll use a library fallback popcount routine rather than the instruction :(
Am I paying 30% overhead for this particular index or that wavelet matrix? Is it double the memory use? Or is it O(log N)? No idea! "doesn't use much more space" could mean a lot of different things!
> Am I paying 30% overhead for this particular index or that wavelet matrix?
Nope! That would not fit the definition. That would be a "compact data structure" according to this definition.
It's also very common in algorithm design that improving the asymptotic bounds and making the algorithm faster (or the data structure smaller) are orthogonal (or even opposite) goals. Real computers have complex performance characteristics and fixed word lengths, and it's rarely a good idea to implement a theoretical algorithm exactly as described.
Succinct data structures often have a number of internal parameters. In a theoretical parameterization, those parameters may be described as being O(log n), O(log^2 n), or O(log log n). In a concrete implementation, it may be a good idea to use constant approximations for some nontrivial (such as 2x) performance gains. O(log n) could become 32 or 64, O(log^2 n) could become 1024 (or a power-of-2 multiple), and O(log log n) could become 4 or 8.
And then, if you parameterize the succinct data structure with these constants, the space overhead becomes a constant fraction.
Consider the rank data structure for bitvectors that is supposed to guarantee O(1) time queries with o(n) bits of space overhead. The bitvector is divided into superblocks of b1 bits and blocks of b2 bits. The top-level index, which stores the rank up to each superblock, uses O(n log n / b1) bits of space. The second-level index, which stores the rank within the superblock up to each block, uses O(n log b1 / b2) bits of space. And then you need to do linear search within the block, which is typically assumed to take O(b2 / log n) or O(b2 / w) time, where w is word size. If you choose b1 = log^2 n and b2 = log n, you get O(1) time with O(n log log n / log n) bits of space overhead. Which is technically sublinear but effectively indistinguishable from linear with realistic values of n.
Real-world implementations use constant values for the parameters. Typically the goal is to get the blocks and indexes align with cache lines and to replace arbitrary divisions with divisions by compile-time power-of-2 constants. Some values I've seen are (b1 = 512, b2 = 64) and (b1 = 65536, b2 = 512). In both cases, the overhead is linear.
And sometimes the implemented data structure is a simplification, because the nominally succinct data structure is too large and slow. For example, it's rare to see actual O(1)-time implementations of select on bitvectors. That would require three levels of indexes with many special cases. It's more common to use two levels of indexes, with queries that almost always constant-time but have (poly)logarithmic worst cases with adversarial inputs.
Honestly, I never actually "trust" the complexity analysis. Whenever I find a paper I immediately look for their specific results on an experiment, and if I can't find that I will assume the paper is only of theoretical interest. There's of course still a lot to learn from a purely theoretical paper, but I've always ended up being disappointed when I've implemented something which only had a "good" asymptotic bounds.
[1] https://pkg.go.dev/golang.org/x/tools/internal/astutil/curso...
[1] https://www.cambridge.org/core/books/compact-data-structures...
[2] https://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pd...
Summary for storing a list of 3M Russian words:
- ~60x less memory
- ~5-10x slower compared to hashmap
Since information is rather scarce, here's my shameless plug for a related blog post I made, which includes a few links at the end to more resources https://denvaar.dev/posts/jacobsons_rank.html
See also: the undergrad who found a breakthrough in hash tables recently and wasn't put off by a long-standing conjecture about what the bound on their performance was because he simply wasn't aware of it
But you can also simply store the tree structure as a single array of ints (or longs) ... each node in the tree corresponds to a unique index position, and the value at that position is the index position of the node's parent (or it's own position if it's a top level node) ... good stuff.
Not sure if they are succinct, but the Apache arrow format encodes data in several ways that is compact in memory but also allows operations on these structures.
The dense implementation is basically based around partitioning them into "blocks" of a given size and then you can find the block by doing `block[idx / block_size]`. It then also groups each block into sub-blocks which helps you even further. All of these are additional data structures (very much like an index) which are stored next to the regular bitset. You use the blocks/sub-blocks to find roughly where in the bitset you are and then use a algorithm for finding the value in a given machine-word.
The sparse implementation treats the bitset as an ordered list of numbers (e.g. 100001001 is interpreted as 0, 5, 8) and then it stores those numbers using Elias-Fano encoding. The higher bits of the Elias-Fano encoding happen to be dense and hence we can use the previous dense implementation on that higher bits and then combine it with the lower bits.
I'm also aware of https://arxiv.org/abs/1706.00990 which is more about how to do this most efficiently at a machine-word level.
"Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors" is another quite recent paper which I haven't fully digested yet.
https://ics.uci.edu/~eppstein/261/BenFar-LCA-00.pdf
For those who can't read, I recommend this talk about it:
Im imagining a meeting about some project design, and thinking about how it'd go if someone suggested using parentheses to represent nodes of a tree. I imagine it'd get written off quickly. Not because it wouldn't work, but because of the complexity and learning curve involved.
This is crazy!