272 pointsby todsacerdoti6 days ago27 comments
  • c-linkage6 days ago
    Interning strings saves a ton of space. I wish more programmers would use it.

    Back in the 32-bit days I was working on a large (multi GB) distributed Oracle database system but I couldn't use transaction log shipping (don't ask). Database A was the "main" system and database B was the "replica". To keep them in sync I needed a program that would compare the two databases and then generate an update script that would make B look like A.

    Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.

    The first attempt by a junior dev was implemented in C#. Not only was it terribly slow, but it also ran out of memory.

    I wrote a new version in C that interned all strings using a hash table and a custom bump-allocator. I also exported every field in the database as a string, so I didn't have to deal with native types. Using this technique meant that a database record could be represented as a plain array of pointers to the interned strings.

    Since each string was only recorded once, and every field was a pointer to a string, should two database records have the same values then they must by definition point to the same string. Comparing database rows was as easy as doing a memcmp() on the two pointer arrays, one being a record from database A and the other being a the record from database B.

    Not only was the system incredibly fast, but it never took more than 150MB of memory to run.

    • gopalv6 days ago
      This is mostly the real reason why interning gets used, to avoid long string comparisons over saving memory as such.

      Interned strings tend to not have a good cleanup mechanism, in a system where a lot of them are churned through. So often they tend to actually use more memory as data patterns evolve in a system.

      I use the same trick when parsing json, where a large set of rows tend to have the keys repeated & the conversion to columnar is easier if the keys are interned.

      • o11c6 days ago
        If your language supports good strong/weak references and containers thereof, cleaning up dead interned strings isn't hard. I'm not aware of any language that provides this out-of-the-box, unfortunately.

        Why do so many languages make weak references such second-class citizens? Why do all containers suck so much that you have to implement your own (and that's hoping the language is actually efficient enough to let you?)

        • rscho6 days ago
          > not aware of any language that provides this out-of-the-box, unfortunately.

          Many lisps. Racket, for example.

        • ayuhito6 days ago
          Go recently did add a new weak pointers and string interning package to its standard library which is an interesting read.

          [0] https://go.dev/blog/unique

          • ignoramous6 days ago
            > string interning package to its standard library

            TFA literally says interning isn't there in Go yet.

              While the unique package is useful, Make is admittedly not quite like Intern for strings, since the Handle[T] is required to keep a string from being deleted from the internal map. This means you need to modify your code to retain handles as well as strings.
            
            > an interesting read

            Tailscale's attempt at implementing "unique" was quite... something: https://tailscale.com/blog/netaddr-new-ip-type-for-go

            • ayuhito5 days ago
              The Tailscale article taught me a lot! Thanks for sharing.
        • igouy5 days ago
          WeakArray? Ephemerons?

          1997 "Ephemerons: A New Finalization Mechanism"

          https://dl.acm.org/doi/pdf/10.1145/263698.263733

          "Linked weak reference arrays: A hybrid approach to efficient bulk finalization"

          https://www.sciencedirect.com/science/article/pii/S016764232...

        • wongarsu6 days ago
          > I'm not aware of any language that provides this out-of-the-box, unfortunately

          The currently most prominent example would be Rust. Rc<T> is a simple generic container that implements reference counting, and any instance of it can be downgraded to a Weak<T>. Or Arc<T> and std::sync::Weak<T> if it needs to be thread safe.

          • o11c6 days ago
            I've done it in C++, so Rust is probably capable of it if you add enough layers of rc refcell and whatever else it requires to fit into its restricted worldview.

            Does Rust actually have container implementations that do all of the following:

            * When walking the container (either iterating or looking up, even through a non-mutable reference), call a user-provided predicate (not just builtin "weak" like many languages have via weakset/weakkeymap/weakvaluemap) to detect if a node should be considered "dead", and if so transparently remove the node. [In my experience this is relatively easy to add when you're implementing the container algorithms yourself, though I've never done it for bulk algorithms yet.]

            * When looking up a key (which may have different type or identity), the lookup returns the actual key the container had. [This may be impossible for container implementations that split the key.]

            • whytevuhuni6 days ago
              Mutating a container through a shared reference means the container either has to be single-threaded (not marked as Sync/Send), or be thread-safe.

              The single-threaded ones are easy to make, but Rust will prevent you from sending them to another thread, which is probably something you want.

              For thread-safe things, look into the crossbeam crate, it has really good collections.

              One I worked with was the dashmap, which has a .retain() method [1] that works over a shared map reference, but runs a closure which gets mutable access to each key and value, and decides whether to keep the pair or not.

              Its .get() [2] uses equality (so you can use a different object), but returns a reference to the original key-value pair. The .get_mut() will return it as mutable, but inside a guard that keeps the item locked until it goes out of scope.

              [1] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...

              [2] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...

              • o11c6 days ago
                `.retain()` unfortunately isn't what I'm talking about, since it's at least O(n) per call. It might be better if you can arrange to call it just once after a batch of operations, but that isn't always the case.

                "Delete as you go" usually† adds no space/time complexity to the operations you're already doing (though it does affect the constant factor). It does mean you're giving up on predictable destructor calls, but generally the "heavy" destructor was called by whoever made it expire (e.g. the death of the last remaining shared owner if that's what expiry is based on).

                † If many expirations happen at once, this can easily drop to merely amortized time. It's also possible for a container to offer, say, O(1) iteration via dedicated pointers but have a delete operation that's more complicated.

            • duped6 days ago
              To the first question, not really, and if it did it would be pretty fragile because of mutability requirements. It's fragile in C++ too because of iterator invalidation, Rust mostly turns that into a compiler error.

              To the second question, yes, it's super common.

              • o11c6 days ago
                I didn't have any problem with iterator/pointer invalidation (easy to merge into the same thing), since holding an active iterator inhibited expiration. It's possible to imagine an expiration system that doesn't automatically guarantee this but I didn't have one; the only non-weak-based expiry I had was based on timers, and it's sanest to only count time as elapsing at the heart of the event loop.
        • jdougan6 days ago
          > not aware of any language that provides this out-of-the-box, unfortunately.

          Smalltalk has this. Class "Symbol".

        • crabbone5 days ago
          > not aware of any language that provides this out-of-the-box, unfortunately.

          ActionScript 3 had it :D

      • kazinator6 days ago
        In Lisp, interning is not only used for saving on string comparisons. It's the basis of the symbol abstraction. Or perhaps not the basis, but interning is the only way symbols can correspond to a printed representation. Without interning, we don't have it that A and A are the same object.

        A symbol being an object with a durable identity is important because it can have properties other than just the string which gives it a name.

    • kccqzy6 days ago
      Here's another perspective: I was once asked to improve a custom data caching system that took too much memory even though it had string interning. (At that time, eviction had to be used on factors other than memory used.) String interning certainly helped with memory use but it wasn't enough. Eventually my solution was to compress each record of the dataset in memory. I found that this saved more memory than interning individual strings within each record. At that time I picked Google's snappy compression algorithm since it was already a dependency of the project, but these days I might have picked zstd with a negative compression level or lz4.

      This just goes to show that if your primary goal is to save memory, you should consider storing it compressed and then decompress on the fly when used. Modern compression algorithms are good and fast on moderately sized textual data. You might be surprised to learn how fast decompression is. There are of course other benefits to interning like using pointer equality as string equality, but these aren't a factor in my project.

      • zerd6 days ago
        We did a very similar thing in a previous project. Used protostuff to serialize each object, and snappy at the time to compress it. And a small proxy class to access it that decompressed on demand. Costs a few microseconds on read, but saved 4x the space (20GiB -> 5GiB IIRC). This was after interning etc. I'm sure you can save a lot more space if you compress batches of objects and store it column oriented, but will take a bit longer to decompress individual objects.
      • nostrademons6 days ago
        Many compression algorithms are effectively string interning that works on general-purpose binary data and adaptively pick the common substrings that are most repeated and assign them the smallest bit representations. That's why formats like XML and JSON compress so well: all those repeated string keys get stored once and then become sub-byte entries in a lookup table.
        • kccqzy6 days ago
          Good point! And since we can have sub-byte entries in a lookup table, no wonder why a simplistic string interning solution using pointer-sized entries in a lookup might not work as effectively to reduce memory used.
    • viraptor6 days ago
      So a fun story about how interning numbers can go wrong: When compiling the apple's ui designs from the xml based xib to a binary nib, their compiler uses lots of interning. It's pretty cool for avoiding 20 copies of an empty string for example. But then they messed up and did the same thing with numbers while ignoring the type... Which means if you have a value 5.0 as a size somewhere, then your sequentially assigned ids will be: 1, 2, 3, 4, 5.0, 6,...
    • MrLeap6 days ago
      > Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.

      Floating point weirdsies between systems is a well encountered quirk in multiplayer gamedev. It's the source of many recommendations that you do physics in fixed point.

      • nyrikki6 days ago
        In oracle, a float by default is 126 binary, or 22 bytes.

        Probably something to do with ANSI history.

        It is even more painful than IEEE 754 ambiguity.

    • 4 days ago
      undefined
    • hobs5 days ago
      That's funny because I just have the database hash the rows and move on with my life, but that might have been a PITA back then.
    • kazinator6 days ago
      > I wish more programmers would use it.

      For instance, people writing blogs to explain Lisp, who then present some Javascript crud in which symbols are just character strings.

    • galkk5 days ago
      String.Intern existed since .net 1.1 . If that was the only problem, then one could’ve, you know, just read the docs?

      https://learn.microsoft.com/en-us/dotnet/api/system.string.i...

  • trevor-e6 days ago
    Great write-up!

    Apologies if this side-tracks the conversation, but why do we as an industry complicate the naming of such techniques? I don't remember what "string interning" is off the top of my head, but a "string lookup table" is immediately obvious. Maybe I'm wrong though and this is a commonly understood name. I guess interning captures both creating the lookup table + using the index in the related data structures. We used this technique at KAYAK to help efficiently construct the list of possible flight combinations for N x M depart/return legs.

    • SnowflakeOnIce6 days ago
      "String interning" is discussed in the Java Language Spec, going back to the 1990s, and wasn't novel there. It goes back at least to Lisp implementations from the 1970s. Probably even earlier than that.
    • umanwizard6 days ago
      Interning is a very common name. "String lookup table" would be too broad, it would mean anything that associates a string with a key.
    • mannyv6 days ago
      So interning is really just normalization/dictionarying your strings.
      • cb3216 days ago
        "pointerizing" might be another name. It would be utterly unsurprising if the database/sql world had more than one other name as well. Usually, ideas that go back to the 1950s at the dawn of data representation have many names.

        Maybe worth noting, you only need the dictionary in one direction (converting an expanded string to a pointer by looking it up). In the other direction, you just need to indirect the pointer and most people don't call an address space indexed by integers a "dictionary" (though, I guess the OG Javascript did merge these two ideas).

        Also notable, how wide a pointer you need (8-bit..64-bit) and how often you do each direction just depends upon workload. This one-direction kind of matters since sometimes your data set is so static that you can just convert to pointers up-front. Then your data analysis doesn't even need the expanded->pointer dictionary. With larger pointers (like 16..64-bit offsets into a file), you just need the file (and whatever data &| files reference it via the pointers).

        Personally, I've run into many "enum-scale" scenarios where 8-bits was enough address space. E.g., there are <256 countries in the world, last I checked. You might need another layer of indirection if you cannot bound the length of per-country data, though.

        As mentioned else thread (at least here by nostrademons: https://news.ycombinator.com/item?id=43245421 but probably elsewhere), data compression algorithms just generalize this to not 8..64 bit, but " 'best' bits per token", but they need to measure up front what 'best' might mean. That domain of research/impl may also have its own terminology. I don't know what to do about "nobody talking to each other" either in this eentsy microcosm or more generally. Society can be tough. :-)

  • Diggsey6 days ago
    I have a Rust crate (ijson) which implements string interning and a generally much more efficient in-memory JSON representation. It probably doesn't compete with the author's specialized implementation, but if you just want a plug and play solution it's pretty good.
  • jcgrillo6 days ago
    It's really shocking to see how wasteful a big blob of JSON is. This is such a great illustration of how much room there is for improvement.
    • throitallaway6 days ago
      The best thing that JSON is good at is human readability. I would say human composibility as well, but YAML and others are better for that. As a machine data exchange and storage format, JSON is extremely inefficient.
      • travisjungroth6 days ago
        I really dislike writing YAML. It might be a skill issue (as the kids say), but it just never landed. I'd rather write JSON, even with all its verbosity.
      • jcgrillo6 days ago
        It never ceases to amaze me how good computer people are at solving problems which are not problems (readability of serialized representation) at the expense of multiple orders of magnitude more compute and storage.
        • williamdclt5 days ago
          > problems which are not problems (readability of serialized representation)

          I don't know that I fully agree. A format that is both human- and machine-readable is very useful, for configuration and debuggability for example. It's indeed more expensive in compute and storage, but that's additional work _for machines_ rather than _for humans_ (having to compile/encode/decode data at the human/machine interface).

          Certainly doesn't mean it's always the right trade-off, but it's an appealing one, as the popularity of these formats shows, in particular JSON which isn't even that good for human-readability (eg no comments in standard JSON).

          • vacuity5 days ago
            But that's the thing: if we want to optimize both machine and human experience, we should create a 1:1 dual format, one which is machine-readable and one which is human-readable. A standard tool easily converts one to the other. This should not still be a problem. Yes, we have computers and should offload work onto them. But exactly what that work is is still important. Two approaches aren't equal because they both put the burden on the machine.
            • addaon5 days ago
              > But that's the thing: if we want to optimize both machine and human experience, we should create a 1:1 dual format, one which is machine-readable and one which is human-readable. A standard tool easily converts one to the other.

              So… plists and plutil?

    • 6 days ago
      undefined
  • Someone6 days ago
    A problem with explicit interning is that libraries, when creating objects, cannot make an informed decision whether to offer some runtime performance in exchange for the expectation of a decrease in memory usage.

    And it is even less than an expectation. Many interning libraries never free objects that they created, so they can keep lots of objects that never are needed again around, and thus increase memory usage.

    I think the ideal API would somehow a) be simple and b) have the program communicate with the system about what they desire from the system. As a first step, the system has to know how much memory the program is willing to use, how much longer it expects to run, and for how long the program plans to retain data it is creating now.

    Simple examples:

    - if the program is shutting down, and there’s a megabyte of memory available, it’s likely detrimental to intern any new data.

    - in a “json grep” tool, interning json keys likely isn’t worth it, as most data will be discarded soon.

    That’s probably at least a dozen of ph.d’s of research, and likely not attainable, though.

  • saghm6 days ago
    I did something similar to reduce the memory overhead of our Rust backend at a recent job; I noticed that when dealing with certain requests, we were using quite a lot of memory on duplicates of the same strings when processing things, so I made something I called an "alias cache" type that could be used to obtain an alias of an existing string if needed, and then introduced those to several places in the codebase where it was most helpful. This strategy also allowed some flexibility on how long strings were "cached" for, which let me tweak which layers of processing should share a cache versus creating their own (and letting the old one drop). Our use case definitely didn't have room for interning to use 2000x less memory, but I think in some of the pathological cases it reduced the max memory usage to around 50% of what it was before without significantly affecting runtime, which I was pretty pleased with!
  • uhgrippa6 days ago
    Very interesting, I particularly enjoyed the handling of memory reduction with the Interned<T> comparison. How do you go about finding open source projects which expose fascinating data such as this one?
  • 3abiton6 days ago
    Very fun read guillaume! I never find the courage to work on my "this seems to be interesting weekend idea" due to my fear of time commitment, but I often find these HN write up somehow fulfilling.
  • xelxebar6 days ago
    Man, lately I've been amazed by how many performance tricks like these are just bog standard APL. For example, string interning looks like this:

        strs←∪input_strings    ⍝ Build
        strs⍪←⊂'new string'    ⍝ Append
        strs⍳⊂'some string'    ⍝ Search
        f←strs∘⍳               ⍝ Hash table creation
    
    One extra advantage on top of standard interning is that this also uses an incarnation of tiny pointers[0], automatically saving 2X memory generally. Better yet, standard APL data design will often obviate the need for most string searches, letting you get away with integer comparisons instead.

    I'm starting to really see how data-oriented design tends to make straightforward code both parsimonious and highly performant.

    [0]:https://news.ycombinator.com/item?id=43023634

  • mleonhard4 days ago
    Would it be simpler to use BTreeSet<Arc<Box<str>>>?

        use std::collections::BTreeSet;
        use std::sync::Arc;
        let mut set: BTreeSet<Arc<Box<str>>> = Default::default();
        let s = Arc::new("abc".to_string().into_boxed_str());
        if !set.contains(&s) {
            set.insert(s.clone());
        }
        let s = set.get(&s).unwrap();
  • 6 days ago
    undefined
  • Horffupolde6 days ago
    What about loading it into ClickHouse?
    • wiredfool6 days ago
      Yeah, I've got some clickhouse tables that store similar data (GBFS), and I've got an amortized size on the 1byte/row level.
  • kajecounterhack6 days ago
    Years ago at Google, the permissions system (OG ganpati, for the googlers) was having major performance issues related to OOM in Java. IIRC there was an epic thread where folks discussed potential resolutions including, this being Java, GC tuning. But in the end, string interning was what helped a ton to defer the scaling issues until something else could be done (mostly a rewrite, smh). I mostly remember this as being a shocking solution to me. Part of me imagined there'd be some fancy scalable solution instead of single-binary optimization, but at the end of the day sometimes you just gotta intern some strings.
  • jessekv6 days ago
    If you are curious about how Python's string interning works:

    https://github.com/python/cpython/blob/main/InternalDocs/str...

  • wdb6 days ago
    As its already in Rust. Next step, is leveraging Apache DataFusion?
  • vtuulos6 days ago
    if you want to see similar tricks applied in Python (with a JIT compiler for query-time optimization), take a look at this fun deck that I presented a long time ago: https://tuulos.github.io/sf-python-meetup-sep-2013

    we were able to handle trillion+ datapoints with relatively modest machines - definitely a useful approach if you are ready to do some bit twiddling

    • ciupicri6 days ago
      Just use `s = sys.intern(s)` for every string and be done with it. Or something like:

          _my_intern_dict = {}
          my_intern = lambda x: _my_intern_dict.setdefault(x, x)
          s = my_intern(s)
      
      Just make sure to delete _my_intern_dict when it's not needed anymore.
  • hu36 days ago
    From the looks of it it seems that interning here is creating references to strings that repeat. Strings are large so if you can store a reference to it, you save space.
  • yencabulator5 days ago
    That's a lot of manual effort for something a better format like Parquet could do automatically (dictionary encoding).
  • Zezima6 days ago
    An amazing example of Rust's abilities. More impressed by the relentless iterative approach taken by the author. Nicely done.
  • 6r176 days ago
    First time I read about interning and this is really cool ; kudos for the article and the usage. Thanks for sharing !
  • henry7006 days ago
    Amazing doing this on a weekend and keeping enough track to write a great article about it within 3 weeks time.
  • valand6 days ago
    Fun fact, V8 interns strings! But it doesn't seem to intern big ones.
    • nostrademons6 days ago
      Many VMs do. The JVM was one of the first to do this extensively - all strings are in a "constant pool" as part of the .class format, and then they're referenced by index in instructions. Python does it for strings that appear in source code, and you can force it for runtime data with sys.intern().
  • Aachen6 days ago
    I never heard of interning before (besides being an intern at a job perhaps) so I gave the post a read and... isn't this called deduplicating in plain English? I didn't understand what context the jargon adds; what makes this special
    • coder5436 days ago
      https://en.wikipedia.org/wiki/String_interning

      It's an established term in software. I would speculate that the word "intern" is short for "internalizing" here.

      • olddustytrail6 days ago
        I think it's from the Lisp "intern" function, which creates an internal symbol if it doesn't already exist and there's no external symbol of the same name.
      • vacuity5 days ago
        "Intern" (job) -> staying at a workplace for some time to be trained and do some work

        "Intern" (software) -> staying resident in memory to be referred to canonically

        I find that often words do come about with a surprising amount of sense.

  • lukeweston12346 days ago
    Great post, one of my more favorite reads in the last few weeks.
  • est5 days ago
    Does column store do this automatically?
  • 6586776466 days ago
    [flagged]
  • 6586776466 days ago
    [flagged]