Back in the 32-bit days I was working on a large (multi GB) distributed Oracle database system but I couldn't use transaction log shipping (don't ask). Database A was the "main" system and database B was the "replica". To keep them in sync I needed a program that would compare the two databases and then generate an update script that would make B look like A.
Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.
The first attempt by a junior dev was implemented in C#. Not only was it terribly slow, but it also ran out of memory.
I wrote a new version in C that interned all strings using a hash table and a custom bump-allocator. I also exported every field in the database as a string, so I didn't have to deal with native types. Using this technique meant that a database record could be represented as a plain array of pointers to the interned strings.
Since each string was only recorded once, and every field was a pointer to a string, should two database records have the same values then they must by definition point to the same string. Comparing database rows was as easy as doing a memcmp() on the two pointer arrays, one being a record from database A and the other being a the record from database B.
Not only was the system incredibly fast, but it never took more than 150MB of memory to run.
Interned strings tend to not have a good cleanup mechanism, in a system where a lot of them are churned through. So often they tend to actually use more memory as data patterns evolve in a system.
I use the same trick when parsing json, where a large set of rows tend to have the keys repeated & the conversion to columnar is easier if the keys are interned.
Why do so many languages make weak references such second-class citizens? Why do all containers suck so much that you have to implement your own (and that's hoping the language is actually efficient enough to let you?)
Many lisps. Racket, for example.
TFA literally says interning isn't there in Go yet.
While the unique package is useful, Make is admittedly not quite like Intern for strings, since the Handle[T] is required to keep a string from being deleted from the internal map. This means you need to modify your code to retain handles as well as strings.
> an interesting readTailscale's attempt at implementing "unique" was quite... something: https://tailscale.com/blog/netaddr-new-ip-type-for-go
1997 "Ephemerons: A New Finalization Mechanism"
https://dl.acm.org/doi/pdf/10.1145/263698.263733
"Linked weak reference arrays: A hybrid approach to efficient bulk finalization"
https://www.sciencedirect.com/science/article/pii/S016764232...
The currently most prominent example would be Rust. Rc<T> is a simple generic container that implements reference counting, and any instance of it can be downgraded to a Weak<T>. Or Arc<T> and std::sync::Weak<T> if it needs to be thread safe.
Does Rust actually have container implementations that do all of the following:
* When walking the container (either iterating or looking up, even through a non-mutable reference), call a user-provided predicate (not just builtin "weak" like many languages have via weakset/weakkeymap/weakvaluemap) to detect if a node should be considered "dead", and if so transparently remove the node. [In my experience this is relatively easy to add when you're implementing the container algorithms yourself, though I've never done it for bulk algorithms yet.]
* When looking up a key (which may have different type or identity), the lookup returns the actual key the container had. [This may be impossible for container implementations that split the key.]
The single-threaded ones are easy to make, but Rust will prevent you from sending them to another thread, which is probably something you want.
For thread-safe things, look into the crossbeam crate, it has really good collections.
One I worked with was the dashmap, which has a .retain() method [1] that works over a shared map reference, but runs a closure which gets mutable access to each key and value, and decides whether to keep the pair or not.
Its .get() [2] uses equality (so you can use a different object), but returns a reference to the original key-value pair. The .get_mut() will return it as mutable, but inside a guard that keeps the item locked until it goes out of scope.
[1] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...
[2] https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html#m...
"Delete as you go" usually† adds no space/time complexity to the operations you're already doing (though it does affect the constant factor). It does mean you're giving up on predictable destructor calls, but generally the "heavy" destructor was called by whoever made it expire (e.g. the death of the last remaining shared owner if that's what expiry is based on).
† If many expirations happen at once, this can easily drop to merely amortized time. It's also possible for a container to offer, say, O(1) iteration via dedicated pointers but have a delete operation that's more complicated.
To the second question, yes, it's super common.
Smalltalk has this. Class "Symbol".
ActionScript 3 had it :D
A symbol being an object with a durable identity is important because it can have properties other than just the string which gives it a name.
This just goes to show that if your primary goal is to save memory, you should consider storing it compressed and then decompress on the fly when used. Modern compression algorithms are good and fast on moderately sized textual data. You might be surprised to learn how fast decompression is. There are of course other benefits to interning like using pointer equality as string equality, but these aren't a factor in my project.
Floating point weirdsies between systems is a well encountered quirk in multiplayer gamedev. It's the source of many recommendations that you do physics in fixed point.
Probably something to do with ANSI history.
It is even more painful than IEEE 754 ambiguity.
For instance, people writing blogs to explain Lisp, who then present some Javascript crud in which symbols are just character strings.
https://learn.microsoft.com/en-us/dotnet/api/system.string.i...
Apologies if this side-tracks the conversation, but why do we as an industry complicate the naming of such techniques? I don't remember what "string interning" is off the top of my head, but a "string lookup table" is immediately obvious. Maybe I'm wrong though and this is a commonly understood name. I guess interning captures both creating the lookup table + using the index in the related data structures. We used this technique at KAYAK to help efficiently construct the list of possible flight combinations for N x M depart/return legs.
Maybe worth noting, you only need the dictionary in one direction (converting an expanded string to a pointer by looking it up). In the other direction, you just need to indirect the pointer and most people don't call an address space indexed by integers a "dictionary" (though, I guess the OG Javascript did merge these two ideas).
Also notable, how wide a pointer you need (8-bit..64-bit) and how often you do each direction just depends upon workload. This one-direction kind of matters since sometimes your data set is so static that you can just convert to pointers up-front. Then your data analysis doesn't even need the expanded->pointer dictionary. With larger pointers (like 16..64-bit offsets into a file), you just need the file (and whatever data &| files reference it via the pointers).
Personally, I've run into many "enum-scale" scenarios where 8-bits was enough address space. E.g., there are <256 countries in the world, last I checked. You might need another layer of indirection if you cannot bound the length of per-country data, though.
As mentioned else thread (at least here by nostrademons: https://news.ycombinator.com/item?id=43245421 but probably elsewhere), data compression algorithms just generalize this to not 8..64 bit, but " 'best' bits per token", but they need to measure up front what 'best' might mean. That domain of research/impl may also have its own terminology. I don't know what to do about "nobody talking to each other" either in this eentsy microcosm or more generally. Society can be tough. :-)
I don't know that I fully agree. A format that is both human- and machine-readable is very useful, for configuration and debuggability for example. It's indeed more expensive in compute and storage, but that's additional work _for machines_ rather than _for humans_ (having to compile/encode/decode data at the human/machine interface).
Certainly doesn't mean it's always the right trade-off, but it's an appealing one, as the popularity of these formats shows, in particular JSON which isn't even that good for human-readability (eg no comments in standard JSON).
So… plists and plutil?
And it is even less than an expectation. Many interning libraries never free objects that they created, so they can keep lots of objects that never are needed again around, and thus increase memory usage.
I think the ideal API would somehow a) be simple and b) have the program communicate with the system about what they desire from the system. As a first step, the system has to know how much memory the program is willing to use, how much longer it expects to run, and for how long the program plans to retain data it is creating now.
Simple examples:
- if the program is shutting down, and there’s a megabyte of memory available, it’s likely detrimental to intern any new data.
- in a “json grep” tool, interning json keys likely isn’t worth it, as most data will be discarded soon.
That’s probably at least a dozen of ph.d’s of research, and likely not attainable, though.
strs←∪input_strings ⍝ Build
strs⍪←⊂'new string' ⍝ Append
strs⍳⊂'some string' ⍝ Search
f←strs∘⍳ ⍝ Hash table creation
One extra advantage on top of standard interning is that this also uses an incarnation of tiny pointers[0], automatically saving 2X memory generally. Better yet, standard APL data design will often obviate the need for most string searches, letting you get away with integer comparisons instead.I'm starting to really see how data-oriented design tends to make straightforward code both parsimonious and highly performant.
use std::collections::BTreeSet;
use std::sync::Arc;
let mut set: BTreeSet<Arc<Box<str>>> = Default::default();
let s = Arc::new("abc".to_string().into_boxed_str());
if !set.contains(&s) {
set.insert(s.clone());
}
let s = set.get(&s).unwrap();
https://github.com/python/cpython/blob/main/InternalDocs/str...
we were able to handle trillion+ datapoints with relatively modest machines - definitely a useful approach if you are ready to do some bit twiddling
_my_intern_dict = {}
my_intern = lambda x: _my_intern_dict.setdefault(x, x)
s = my_intern(s)
Just make sure to delete _my_intern_dict when it's not needed anymore.It's an established term in software. I would speculate that the word "intern" is short for "internalizing" here.
"Intern" (software) -> staying resident in memory to be referred to canonically
I find that often words do come about with a surprising amount of sense.