Figuring out how to develop a Unicode collator from scratch for a research group that I working with in Berlin was one of my formative experiences as a programmer. Ever since then, I've wanted to write something to collect my thoughts on the Unicode Collation Algorithm and the process of building a conformant implementation. Last summer I had a good excuse to do this, when I decided to adapt my collator to Zig as a way of learning that language.
The Unicode standards, and the (relatively) low-level software libraries based on them, do a lot of things for us to make computing possible. We have the luxury of not needing to worry about most of those things most of the time. I find it humbling whenever I do peek under the hood.
I experimented with adding an LRU cache to my Rust UCA implementation, but I saw essentially no performance benefit (on the workloads I had), and I decided the feature wasn't worth the complexity and removed it.
Something I found about Unicode collation is that, once the fast paths are added, they get hit a surprisingly large percentage of the time. I'm thinking in particular of the way that performant UCA implementations build sort keys lazily, stopping once a collation decision is reached. The average "point of difference" is at the primary level and within a few characters of the start of each string. Only a small portion of the sort keys ever get built.
I am definitely interested in finding more ways to avoid work in the collation routine. Many times, I've had what I thought was a clever idea and found that it didn't pan out in benchmarks. Thank you for your comment!