The power of interning: making a time series database smaller(gendignoux.com)

273 pointsby todsacerdoti4 months ago27 comments

c-linkage4 months ago
Interning strings saves a ton of space. I wish more programmers would use it.
Back in the 32-bit days I was working on a large (multi GB) distributed Oracle database system but I couldn't use transaction log shipping (don't ask). Database A was the "main" system and database B was the "replica". To keep them in sync I needed a program that would compare the two databases and then generate an update script that would make B look like A.
Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.
The first attempt by a junior dev was implemented in C#. Not only was it terribly slow, but it also ran out of memory.
I wrote a new version in C that interned all strings using a hash table and a custom bump-allocator. I also exported every field in the database as a string, so I didn't have to deal with native types. Using this technique meant that a database record could be represented as a plain array of pointers to the interned strings.
Since each string was only recorded once, and every field was a pointer to a string, should two database records have the same values then they must by definition point to the same string. Comparing database rows was as easy as doing a memcmp() on the two pointer arrays, one being a record from database A and the other being a the record from database B.
Not only was the system incredibly fast, but it never took more than 150MB of memory to run.
- gopalv4 months ago
  This is mostly the real reason why interning gets used, to avoid long string comparisons over saving memory as such.
  Interned strings tend to not have a good cleanup mechanism, in a system where a lot of them are churned through. So often they tend to actually use more memory as data patterns evolve in a system.
  I use the same trick when parsing json, where a large set of rows tend to have the keys repeated & the conversion to columnar is easier if the keys are interned.
  - o11c4 months ago
    If your language supports good strong/weak references and containers thereof, cleaning up dead interned strings isn't hard. I'm not aware of any language that provides this out-of-the-box, unfortunately.
    Why do so many languages make weak references such second-class citizens? Why do all containers suck so much that you have to implement your own (and that's hoping the language is actually efficient enough to let you?)
    rscho4 months ago
    > not aware of any language that provides this out-of-the-box, unfortunately.
    Many lisps. Racket, for example.
    ayuhito4 months ago
    Go recently did add a new weak pointers and string interning package to its standard library which is an interesting read.
    [0] https://go.dev/blog/unique
    ignoramous4 months ago
    > string interning package to its standard library
    TFA literally says interning isn't there in Go yet.
    While the unique package is useful, Make is admittedly not quite like Intern for strings, since the Handle[T] is required to keep a string from being deleted from the internal map. This means you need to modify your code to retain handles as well as strings.
    > an interesting read
    Tailscale's attempt at implementing "unique" was quite... something: https://tailscale.com/blog/netaddr-new-ip-type-for-go
    ayuhito4 months ago
    The Tailscale article taught me a lot! Thanks for sharing.
    igouy4 months ago
    WeakArray? Ephemerons?
    1997 "Ephemerons: A New Finalization Mechanism"
    https://dl.acm.org/doi/pdf/10.1145/263698.263733
    "Linked weak reference arrays: A hybrid approach to efficient bulk finalization"
    https://www.sciencedirect.com/science/article/pii/S016764232...
    wongarsu4 months ago
    > I'm not aware of any language that provides this out-of-the-box, unfortunately
    The currently most prominent example would be Rust. Rc<T> is a simple generic container that implements reference counting, and any instance of it can be downgraded to a Weak<T>. Or Arc<T> and std::sync::Weak<T> if it needs to be thread safe.
    o11c4 months ago
    I've done it in C++, so Rust is probably capable of it if you add enough layers of rc refcell and whatever else it requires to fit into its restricted worldview.
    Does Rust actually have container implementations that do all of the following:
    * When walking the container (either iterating or looking up, even through a non-mutable reference), call a user-provided predicate (not just builtin "weak" like many languages have via weakset/weakkeymap/weakvaluemap) to detect if a node should be considered "dead", and if so transparently remove the node. [In my experience this is relatively easy to add when you're implementing the container algorithms yourself, though I've never done it for bulk algorithms yet.]
    * When looking up a key (which may have different type or identity), the lookup returns the actual key the container had. [This may be impossible for container implementations that split the key.]
    whytevuhuni4 months ago
    Mutating a container through a shared reference means the container either has to be single-threaded (not marked as Sync/Send), or be thread-safe.
    The single-threaded ones are easy to make, but Rust will prevent you from sending them to another thread, which is probably something you want.
    For thread-safe things, look into the crossbeam crate, it has really good collections.
    One I worked with was the dashmap, which has a .retain() method [1] that works over a shared map reference, but runs a closure which gets mutable access to each key and value, and decides whether to keep the pair or not.
    Its .get() [2] uses equality (so you can use a different object), but returns a reference to the original key-value pair. The .get_mut() will return it as mutable, but inside a guard that keeps the item locked until it goes out of scope.
    [1] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...
    [2] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...
    o11c4 months ago
    `.retain()` unfortunately isn't what I'm talking about, since it's at least O(n) per call. It might be better if you can arrange to call it just once after a batch of operations, but that isn't always the case.
    "Delete as you go" usually† adds no space/time complexity to the operations you're already doing (though it does affect the constant factor). It does mean you're giving up on predictable destructor calls, but generally the "heavy" destructor was called by whoever made it expire (e.g. the death of the last remaining shared owner if that's what expiry is based on).
    † If many expirations happen at once, this can easily drop to merely amortized time. It's also possible for a container to offer, say, O(1) iteration via dedicated pointers but have a delete operation that's more complicated.
    duped4 months ago
    To the first question, not really, and if it did it would be pretty fragile because of mutability requirements. It's fragile in C++ too because of iterator invalidation, Rust mostly turns that into a compiler error.
    To the second question, yes, it's super common.
    o11c4 months ago
    I didn't have any problem with iterator/pointer invalidation (easy to merge into the same thing), since holding an active iterator inhibited expiration. It's possible to imagine an expiration system that doesn't automatically guarantee this but I didn't have one; the only non-weak-based expiry I had was based on timers, and it's sanest to only count time as elapsing at the heart of the event loop.
    jdougan4 months ago
    > not aware of any language that provides this out-of-the-box, unfortunately.
    Smalltalk has this. Class "Symbol".
    crabbone4 months ago
    > not aware of any language that provides this out-of-the-box, unfortunately.
    ActionScript 3 had it :D
  - kazinator4 months ago
    In Lisp, interning is not only used for saving on string comparisons. It's the basis of the symbol abstraction. Or perhaps not the basis, but interning is the only way symbols can correspond to a printed representation. Without interning, we don't have it that A and A are the same object.
    A symbol being an object with a durable identity is important because it can have properties other than just the string which gives it a name.
- kccqzy4 months ago
  Here's another perspective: I was once asked to improve a custom data caching system that took too much memory even though it had string interning. (At that time, eviction had to be used on factors other than memory used.) String interning certainly helped with memory use but it wasn't enough. Eventually my solution was to compress each record of the dataset in memory. I found that this saved more memory than interning individual strings within each record. At that time I picked Google's snappy compression algorithm since it was already a dependency of the project, but these days I might have picked zstd with a negative compression level or lz4.
  This just goes to show that if your primary goal is to save memory, you should consider storing it compressed and then decompress on the fly when used. Modern compression algorithms are good and fast on moderately sized textual data. You might be surprised to learn how fast decompression is. There are of course other benefits to interning like using pointer equality as string equality, but these aren't a factor in my project.
  - zerd4 months ago
    We did a very similar thing in a previous project. Used protostuff to serialize each object, and snappy at the time to compress it. And a small proxy class to access it that decompressed on demand. Costs a few microseconds on read, but saved 4x the space (20GiB -> 5GiB IIRC). This was after interning etc. I'm sure you can save a lot more space if you compress batches of objects and store it column oriented, but will take a bit longer to decompress individual objects.
  - nostrademons4 months ago
    Many compression algorithms are effectively string interning that works on general-purpose binary data and adaptively pick the common substrings that are most repeated and assign them the smallest bit representations. That's why formats like XML and JSON compress so well: all those repeated string keys get stored once and then become sub-byte entries in a lookup table.
    kccqzy4 months ago
    Good point! And since we can have sub-byte entries in a lookup table, no wonder why a simplistic string interning solution using pointer-sized entries in a lookup might not work as effectively to reduce memory used.
- viraptor4 months ago
  So a fun story about how interning numbers can go wrong: When compiling the apple's ui designs from the xml based xib to a binary nib, their compiler uses lots of interning. It's pretty cool for avoiding 20 copies of an empty string for example. But then they messed up and did the same thing with numbers while ignoring the type... Which means if you have a value 5.0 as a size somewhere, then your sequentially assigned ids will be: 1, 2, 3, 4, 5.0, 6,...
- MrLeap4 months ago
  > Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.
  Floating point weirdsies between systems is a well encountered quirk in multiplayer gamedev. It's the source of many recommendations that you do physics in fixed point.
  - nyrikki4 months ago
    In oracle, a float by default is 126 binary, or 22 bytes.
    Probably something to do with ANSI history.
    It is even more painful than IEEE 754 ambiguity.
- 4 months ago
  undefined
- hobs4 months ago
  That's funny because I just have the database hash the rows and move on with my life, but that might have been a PITA back then.
- kazinator4 months ago
  > I wish more programmers would use it.
  For instance, people writing blogs to explain Lisp, who then present some Javascript crud in which symbols are just character strings.
- galkk4 months ago
  String.Intern existed since .net 1.1 . If that was the only problem, then one could’ve, you know, just read the docs?
  https://learn.microsoft.com/en-us/dotnet/api/system.string.i...
trevor-e4 months ago
Great write-up!
Apologies if this side-tracks the conversation, but why do we as an industry complicate the naming of such techniques? I don't remember what "string interning" is off the top of my head, but a "string lookup table" is immediately obvious. Maybe I'm wrong though and this is a commonly understood name. I guess interning captures both creating the lookup table + using the index in the related data structures. We used this technique at KAYAK to help efficiently construct the list of possible flight combinations for N x M depart/return legs.
- SnowflakeOnIce4 months ago
  "String interning" is discussed in the Java Language Spec, going back to the 1990s, and wasn't novel there. It goes back at least to Lisp implementations from the 1970s. Probably even earlier than that.
- umanwizard4 months ago
  Interning is a very common name. "String lookup table" would be too broad, it would mean anything that associates a string with a key.
- mannyv4 months ago
  So interning is really just normalization/dictionarying your strings.
  - cb3214 months ago
    "pointerizing" might be another name. It would be utterly unsurprising if the database/sql world had more than one other name as well. Usually, ideas that go back to the 1950s at the dawn of data representation have many names.
    Maybe worth noting, you only need the dictionary in one direction (converting an expanded string to a pointer by looking it up). In the other direction, you just need to indirect the pointer and most people don't call an address space indexed by integers a "dictionary" (though, I guess the OG Javascript did merge these two ideas).
    Also notable, how wide a pointer you need (8-bit..64-bit) and how often you do each direction just depends upon workload. This one-direction kind of matters since sometimes your data set is so static that you can just convert to pointers up-front. Then your data analysis doesn't even need the expanded->pointer dictionary. With larger pointers (like 16..64-bit offsets into a file), you just need the file (and whatever data &| files reference it via the pointers).
    Personally, I've run into many "enum-scale" scenarios where 8-bits was enough address space. E.g., there are <256 countries in the world, last I checked. You might need another layer of indirection if you cannot bound the length of per-country data, though.
    As mentioned else thread (at least here by nostrademons: https://news.ycombinator.com/item?id=43245421 but probably elsewhere), data compression algorithms just generalize this to not 8..64 bit, but " 'best' bits per token", but they need to measure up front what 'best' might mean. That domain of research/impl may also have its own terminology. I don't know what to do about "nobody talking to each other" either in this eentsy microcosm or more generally. Society can be tough. :-)
Diggsey4 months ago
I have a Rust crate (ijson) which implements string interning and a generally much more efficient in-memory JSON representation. It probably doesn't compete with the author's specialized implementation, but if you just want a plug and play solution it's pretty good.
jcgrillo4 months ago
It's really shocking to see how wasteful a big blob of JSON is. This is such a great illustration of how much room there is for improvement.
- throitallaway4 months ago
  The best thing that JSON is good at is human readability. I would say human composibility as well, but YAML and others are better for that. As a machine data exchange and storage format, JSON is extremely inefficient.
  - travisjungroth4 months ago
    I really dislike writing YAML. It might be a skill issue (as the kids say), but it just never landed. I'd rather write JSON, even with all its verbosity.
  - jcgrillo4 months ago
    It never ceases to amaze me how good computer people are at solving problems which are not problems (readability of serialized representation) at the expense of multiple orders of magnitude more compute and storage.
    williamdclt4 months ago
    > problems which are not problems (readability of serialized representation)
    I don't know that I fully agree. A format that is both human- and machine-readable is very useful, for configuration and debuggability for example. It's indeed more expensive in compute and storage, but that's additional work _for machines_ rather than _for humans_ (having to compile/encode/decode data at the human/machine interface).
    Certainly doesn't mean it's always the right trade-off, but it's an appealing one, as the popularity of these formats shows, in particular JSON which isn't even that good for human-readability (eg no comments in standard JSON).
    vacuity4 months ago
    But that's the thing: if we want to optimize both machine and human experience, we should create a 1:1 dual format, one which is machine-readable and one which is human-readable. A standard tool easily converts one to the other. This should not still be a problem. Yes, we have computers and should offload work onto them. But exactly what that work is is still important. Two approaches aren't equal because they both put the burden on the machine.
    addaon4 months ago
    > But that's the thing: if we want to optimize both machine and human experience, we should create a 1:1 dual format, one which is machine-readable and one which is human-readable. A standard tool easily converts one to the other.
    So… plists and plutil?
- 4 months ago
  undefined
Someone4 months ago
A problem with explicit interning is that libraries, when creating objects, cannot make an informed decision whether to offer some runtime performance in exchange for the expectation of a decrease in memory usage.
And it is even less than an expectation. Many interning libraries never free objects that they created, so they can keep lots of objects that never are needed again around, and thus increase memory usage.
I think the ideal API would somehow a) be simple and b) have the program communicate with the system about what they desire from the system. As a first step, the system has to know how much memory the program is willing to use, how much longer it expects to run, and for how long the program plans to retain data it is creating now.
Simple examples:
- if the program is shutting down, and there’s a megabyte of memory available, it’s likely detrimental to intern any new data.
- in a “json grep” tool, interning json keys likely isn’t worth it, as most data will be discarded soon.
That’s probably at least a dozen of ph.d’s of research, and likely not attainable, though.
saghm4 months ago
I did something similar to reduce the memory overhead of our Rust backend at a recent job; I noticed that when dealing with certain requests, we were using quite a lot of memory on duplicates of the same strings when processing things, so I made something I called an "alias cache" type that could be used to obtain an alias of an existing string if needed, and then introduced those to several places in the codebase where it was most helpful. This strategy also allowed some flexibility on how long strings were "cached" for, which let me tweak which layers of processing should share a cache versus creating their own (and letting the old one drop). Our use case definitely didn't have room for interning to use 2000x less memory, but I think in some of the pathological cases it reduced the max memory usage to around 50% of what it was before without significantly affecting runtime, which I was pretty pleased with!
uhgrippa4 months ago
Very interesting, I particularly enjoyed the handling of memory reduction with the Interned<T> comparison. How do you go about finding open source projects which expose fascinating data such as this one?
3abiton4 months ago
Very fun read guillaume! I never find the courage to work on my "this seems to be interesting weekend idea" due to my fear of time commitment, but I often find these HN write up somehow fulfilling.
xelxebar4 months ago
Man, lately I've been amazed by how many performance tricks like these are just bog standard APL. For example, string interning looks like this:
```
    strs←∪input_strings    ⍝ Build
    strs⍪←⊂'new string'    ⍝ Append
    strs⍳⊂'some string'    ⍝ Search
    f←strs∘⍳               ⍝ Hash table creation
```
One extra advantage on top of standard interning is that this also uses an incarnation of tiny pointers[0], automatically saving 2X memory generally. Better yet, standard APL data design will often obviate the need for most string searches, letting you get away with integer comparisons instead.
I'm starting to really see how data-oriented design tends to make straightforward code both parsimonious and highly performant.
[0]:https://news.ycombinator.com/item?id=43023634
4 months ago
undefined
Horffupolde4 months ago
What about loading it into ClickHouse?
- wiredfool4 months ago
  Yeah, I've got some clickhouse tables that store similar data (GBFS), and I've got an amortized size on the 1byte/row level.

Would it be simpler to use BTreeSet<Arc<Box<str>>>?

    use std::collections::BTreeSet;
    use std::sync::Arc;
    let mut set: BTreeSet<Arc<Box<str>>> = Default::default();
    let s = Arc::new("abc".to_string().into_boxed_str());
    if !set.contains(&s) {
        set.insert(s.clone());
    }
    let s = set.get(&s).unwrap();

kajecounterhack4 months ago
Years ago at Google, the permissions system (OG ganpati, for the googlers) was having major performance issues related to OOM in Java. IIRC there was an epic thread where folks discussed potential resolutions including, this being Java, GC tuning. But in the end, string interning was what helped a ton to defer the scaling issues until something else could be done (mostly a rewrite, smh). I mostly remember this as being a shocking solution to me. Part of me imagined there'd be some fancy scalable solution instead of single-binary optimization, but at the end of the day sometimes you just gotta intern some strings.
jessekv4 months ago
If you are curious about how Python's string interning works:
https://github.com/python/cpython/blob/main/InternalDocs/str...
wdb4 months ago
As its already in Rust. Next step, is leveraging Apache DataFusion?
vtuulos4 months ago
if you want to see similar tricks applied in Python (with a JIT compiler for query-time optimization), take a look at this fun deck that I presented a long time ago: https://tuulos.github.io/sf-python-meetup-sep-2013
we were able to handle trillion+ datapoints with relatively modest machines - definitely a useful approach if you are ready to do some bit twiddling
- ciupicri4 months ago
  Just use `s = sys.intern(s)` for every string and be done with it. Or something like:
  _my_intern_dict = {} my_intern = lambda x: _my_intern_dict.setdefault(x, x) s = my_intern(s)
  Just make sure to delete _my_intern_dict when it's not needed anymore.
hu34 months ago
From the looks of it it seems that interning here is creating references to strings that repeat. Strings are large so if you can store a reference to it, you save space.
yencabulator4 months ago
That's a lot of manual effort for something a better format like Parquet could do automatically (dictionary encoding).
Zezima4 months ago
An amazing example of Rust's abilities. More impressed by the relentless iterative approach taken by the author. Nicely done.
6r174 months ago
First time I read about interning and this is really cool ; kudos for the article and the usage. Thanks for sharing !
henry7004 months ago
Amazing doing this on a weekend and keeping enough track to write a great article about it within 3 weeks time.
valand4 months ago
Fun fact, V8 interns strings! But it doesn't seem to intern big ones.
- nostrademons4 months ago
  Many VMs do. The JVM was one of the first to do this extensively - all strings are in a "constant pool" as part of the .class format, and then they're referenced by index in instructions. Python does it for strings that appear in source code, and you can force it for runtime data with sys.intern().
Aachen4 months ago
I never heard of interning before (besides being an intern at a job perhaps) so I gave the post a read and... isn't this called deduplicating in plain English? I didn't understand what context the jargon adds; what makes this special
- coder5434 months ago
  https://en.wikipedia.org/wiki/String_interning
  It's an established term in software. I would speculate that the word "intern" is short for "internalizing" here.
  - olddustytrail4 months ago
    I think it's from the Lisp "intern" function, which creates an internal symbol if it doesn't already exist and there's no external symbol of the same name.
  - vacuity4 months ago
    "Intern" (job) -> staying at a workplace for some time to be trained and do some work
    "Intern" (software) -> staying resident in memory to be referred to canonically
    I find that often words do come about with a surprising amount of sense.
lukeweston12344 months ago
Great post, one of my more favorite reads in the last few weeks.
est4 months ago
Does column store do this automatically?
6586776464 months ago
[flagged]
6586776464 months ago
[flagged]