Over the years, the implementation of Java’s String class has been improved again and again, offering performance improvements and memory usage reduction. And us Java developers get these improvements with no work required other than updating the JRE we use.
All the low-hanging fruit was taken years ago, of course. These days, I’m sure most Java apps would barely get any noticeable improvement from further String improvements, such as the one in the article we’re discussing.
These kinds of niche optimizations are still significant. The OOP model allows them to be implemented with much less fanfare. This is in the context of billion-dollar platforms. With some basic performance testing and API replays, we're saving thousands of dollars a day. Nobody gets a pat on the back. Maybe some pizza on Friday.
Turns out you can write java without the stuff. No getters and setters, no interfaces or dependency injection, no separate application server (just embed one in your jar). No inheritence. Indeed no OOP (just data classes and static methods).
Just simple, c-like code with the amazing ecosystem of libraries and the incredibly fast marvel that is the JVM and (though this is less of a deal now with LLM autocomplete) a simple built in type system that makes the code practically write itself with autocomplete.
It's truly an awesome dev experience if you just have the power / culture to ignore the forces pressuring you to use the 'stuff'.
But that free-thinking definition of Java clashes with the mainstream beliefs in the Java ecosystem, and you'll get a lot opposition at workplaces.
So I gave up on Java, not because of the language, but because of the people, and the forced culture around it.
I have come to see this as a mix of business people and developers not doing their jobs to protect thier paycheck. Business people, if they want to succeed, need to be converging on a strategy that makes money. Developers need to have a strategy for removing technical barriers to realizing that strategy. The lack of a business strategy often makes an overly general technical platform look attractive.
> focusing on the tech as if it were the destination
So common. Complexity should be considered the enemy, not an old friend.
There is more to computer programming than the OOP clutter.
"We rewrote 200k line Java codebase in only 50k lines. Guess what language we used?"
The follow-up tweet a day later:
"It was Java."
Yup. Prefer composition over inheritance.
Such a hard sell during the heyday of OOAD. UML, Fusion, RUP, blahblahblah.
Having previous experience with LISP, it just seemed obvious, barely worth mentioning. Definitely set me apart from my colleagues.
FWIW: Our local design pattern study group tackled Arthur J. Riel's Object-Oriented Design Heuristics [1996] https://archive.org/details/objectorientedde0000riel https://www.amazon.com/Object-Oriented-Design-Heuristics-Art... Which might be the earliest published pushback against all that overwrought enterprisey ThoughtWorks-style brainrot.
> No ... dependency injection
Yes and: Mocks (mocking?) is a code stench. Just flip the ownership relationship(s). As Riel, and surely many others, have laboriously explained, to a mostly uncaring world.
I've mostly done ASTs and (scene) graphs. I prefer dumb objects, stuffing most behavior and logic in an Interpreter (design pattern) implementation. Being a very simple bear, my working memory can't keep track of all the smarts scattered all about.
My current example is processing SQL statements; think expression evaluator (subset) for a very specific use case. I tend to use so called "External" Iterators for walking trees, to keeping all the logic in one place. Versus Visitor, Listeners, or even Active Objects. Which is feasible for this use case, because its bounded and unlikely to change (eg extension points).
YMMV of course. Now I'm just babbling, apologies, and probably went too far off topic.
Then there's C# which most anyone who's enthusiastic about software dev will find far nicer to work with, but it's probably harder for bargain basement offshore sweatshops to bang their head against.
Java has went over this evolution, implemented generics, lambdas, etc and I believe it strikes a very good balance in not being overly complex (just look at the spec - it's still a very small language, compared to its age, unlike C++ or C#).
Go tried to re-invent this evolution, without having learnt Java's lessons. They will add more and more features until their "simple" will stop applying (though I personally believe that their simple was always just simplistic), simply because you need some expressivity for better libraries, which will later on actually simplify user code.
Also relevant: https://www.tedinski.com/2018/01/30/the-one-ring-problem-abs...
Also, not using as much memory in these types of GCs is a direct hit to performance. And this actually shows splendidly on GC-heavy applications/benchmarks.
This statement surprised me. I can't even remember last time I ran any opensource Java.
There are a couple of Java projects, and even one or two kind of successful ones. But Java in open source is very rare, not the boring workhorse.
If I worked on a project that used Bazel, then sure, I'd use Bazel every day.
But which is "the boring workhorse" of open source, if I gave you the option of Java, Make, Linux, gcc, llvm, .deb, would Java really be "the" one?
Sure, maybe you could exclude most of those as not being "boring", like llvm. But "make" wins by any measure. And of course, it's almost by definition hard to think about the boring workhorse, because the nature of it is that you don't think about it.
Checking now, the only reason I can find java even being installed on my dev machines is for Arduino IDE and my Android development environment. Pretty niche stuff, in the open source space.
It doesn't mean "more than zero projects are Java based". Nor does it mean "most (opensource?) Java applications are based on open source". That latter is borderline circular, only Oracle legal shenanigans makes it not circular.
> and Java dominates enterprise space
I said nothing about enterprise. Clearly Java is HUGE in enterprise.
> so it is a huge open source workhorse
That sentence took a strange turn. Enterprise, and then back to open source?
> just more obscure than Linux, gcc etc.
Obscure? I'd expect Java to be about as strong a brand as Linux. Among developers in general I'd expect gcc to be orders of magnitude more obscure. There's no programmer out there who has not heard of Java, but many have never heard of gcc.
You said what it is not, but forgot to share your own definition.
>That sentence took a strange turn. Enterprise, and then back to open source?
What makes you so surprised? One does not exclude another, enterprise users are users too. Most of things in Java world aren’t client-side, so many users won’t observe them directly, but open source Java technology is doing a lot of work for them, constituting significant share of the code base.
There are worse fundamental problems in Java. For example the lack of a proper numeric tower. Or the need to rely on annotations to indicate something as basic as nullabilty.
That sounds like a recipe for disaster though, as it generally makes code much harder to read.
In my experience OOP is actually pretty pleasant to work with if you avoid extending classes as much as possible.
> These kinds of niche optimizations are still significant. The OOP model allows them to be implemented with much less fanfare.
If you're referring to the optimization in the article posted then I would argue an OOP model is not needed for it, just having encapsulation is enough.
I think “avoid extending classes” is there because it is as good as impossible to design classes that can be extended easily in ways you do not foresee, and if you do foresee how your classes could be extended, it often is easier for your users if you made your classes more flexible, to start with.
If you removed all the stuff related to inheritance and trying to fix the leaky abstraction that is objects, the language would be a fraction of the size (compare with Go or StandardML for how small a language without inheritance can be).
Whether or not this is an endorsement of OOP or a criticism is open to interpretation.
For a long time, Java was like, every classes is a library, i do not think it's a failure of OOP, it's a failure of Java.
But I'm optimistic, I choose to see recent additions like records and pattern matching has a step in the right direction.
My thoughts exactly. Give me more classes with shallower inheritance hierarchies. Here is where I think go’s approach makes sense.
Then you can get to benefit from Java's unparalleled ecosystem of enterprise hardened libraries, monitoring etc.
Have you tried updating production usage of a JRE before??
Java 8 -> 9 is the largest source of annoyances, past that it's essentially painless.
You just change a line (the version of the JRE) and you get a faster JVM with better GC.
And with ZGC nowadays garbage collection is essentially a solved problem.
I worked on a piece of software serving almost 5 million requests per second on a single (albeit fairly large) box off a single JVM and I was still seeing GC pauses below the single millisecond (~800 usec p99 stop the world pauses) despite the very high allocation rate (~60gb/sec).
The JVM is a marvel of software engineering.
With projects like OpenRewrite [1] and good LLMs, things are a lot easier these days.
Depending on what sort of document you're looking for, you might like either the JEP: https://openjdk.org/jeps/254
or Shipilev's slides (pdf warning): https://shipilev.net/talks/jfokus-Feb2016-lord-of-the-string...
Shipilev's website (https://shipilev.net/#lord-of-the-strings), and links from the JEP above to other JEPS, are both good places to find further reading.
(I think I saw a feature article about the implementation of the string compression feature, but I'm not sure who wrote it or where it was, or if I'm thinking about something else. Actually I think it might've been https://shipilev.net/blog/2015/black-magic-method-dispatch/, despite the title.)
Shipilev's website looks like a fascinating resource. I appreciate the pointer!
I would, but unfortunately I got a NullPointerException.
I suggest you try Rust instead; its borrow checker will ensure you can't share pointers in an unsafe manner.
https://cr.openjdk.org/~pminborg/stable-values2/api/java.bas...
I don't understand the functional difference between the suggested StableValue and Records, or Value Classes.
They define a StableValue as:
> "A stable value is a holder of contents that can be set at most once."
Records were defined as: > "... classes that act as transparent carriers for immutable data. Records can be thought of as nominal tuples."
And Value Objects/Classes as: > "... value objects, class instances that have only final fields and lack object identity."
Both Records and Value Objects are immutable, and hence can only have their contents set upon creation or static initalization.The implementation of a value object will be able to use StableValue internally for lazy computation and/or caching of derived values.
Which, to me, means, potentially, two things.
One, that the JVM can de-dup "anything", like, in theory, it can with Strings now. VOs that are equal are the same, rather than relying on object identity.
But, also, two, it can copy the contents of the VO to consolidate them into a single unit.
Typically, Java Objects and records are blobs of pointers. Each field pointing to something else.
With Value Objects that may not be the case. Instead of acting as a collection of pointers, a VO with VOs in it may more be like a C struct containing structs itself -- a single, continuous block of memory.
So, an Object is a collection of pointers. A Record is a collection of immutable pointers. A Value Object is (may be) a cohesive, contiguous block of memory to represent its contents.
Records are also immutable, but you can create them and delete them throughout your application like you would a regular class.
Yes, but remind people it's not static in the sense of being associated with the class, nor constant for compile-time purposes.
Perhaps better to say: A stable value is lazy, set on first use, resulting in pre- and post- initialization states. The data being set once means you cannot observe a data change (i.e., appears to be immutable), but you could observe reduction in resource utilization when comparing instances with pre-set or un-set values -- less memory or time or other side-effects of value initialization.
So even if data-immutable, a class with a stable value ends up with behavior combinations of two states for each stable value. Immutable records or classes without stable values have no such behavior changes.
But, writ large, we've always had this with the JVM's hotspot optimizations.
For String, it becomes significant whether hashcode is used when calculating equals (as a fast path to negative result). If not, one would have two equal instances that will behave differently (though producing the same data), at least for one hashcode-dependent operation.
record Rational(int num, int denom) {
Rational {
int gcd = gcd(num, denom);
num /= gcd;
denom /= gcd;
}
}
I assume it's static in the context of it's containing object. So, it will be collected when it's string is collected.
(which is why it will be computed over and over again if your special string happens to hash to 0)
What level are you suggesting lateinit happens at if not on the JVM?
It should really be something like
public stable logger = () -> new Logger(/* .. */).
Where the JDK hides the details of making sure the value is only created once, basically like the classholder idiom but under the hood. I'm sure there are reasons why they're not doing it that way, but ... it's definitely what the language needs to be able to do.Incidentally, I've always appreciated for python PEPs how they list all of the obvious complaints about an issue and explain methodically why each was determined not to work. The JEPs don't seem to reach quite the same candor.
static final Complex CONSTANT = new Complex(1, 2);
If you want a lazy initialized constant, you want a stable value static final StableValue<Complex> STABLE_VALUE = StableValue.of();
Complex getLazyConstant() {
return STABLE_VALUE.orElseGet(() -> new Complex(1, 2))
}
If you want the fields of a constant to be constant too, Complex has to be declared has a record.Additionally, having to define a record FooHolder(Foo foo) simply to hold a Foo would be a lot more cumbersome than just saying StableValue<Foo> fooHolder = StableValue.of(); There's no need for an extra type.
- Value classes have no identity, which means they can't have synchronized methods and don't have an object monitor. While it would be possible to store a value object inside a StableValue, there are plenty of use cases for an identity object inside a StableValue, such as the Logger example inside the JEP: one could easily imagine a fictional logger having a synchronized method to preserve ordering of logs.
I wouldn't say these are all entirely orthogonal concerns, but they are different concepts with different purposes.
`@Stable` annotation (only internal for now) and `StableValue<>` (for user code in future) says JIT that programmer guarantee (swear by his life!) that no dirty tricks are played with these values in whole codebase and JIT can constant-fold these values as soon as they are initialized.
It means, that `StableValue<>` can be used in simple classes (where `final` fields are still not constant-folded) and, additionally, supports late initialization.
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Per Hyrum's law, there's no changing it now.https://docs.oracle.com/javase/8/docs/api/java/lang/String.h...
But, it's actually desirable to have some cryptographic properties in a faster one way function for making hash tables. Read about SipHash to see why.
Because Java didn't (and as others have discussed, now can't) choose otherwise the hash table structures provided must resist sabotage via collision, which isn't necessary if your attacker can't collide the hash you use.
I assume Java has to add some other layer to avoid this, rather than using a collision resistant hash scheme.
the application creator needed to have anticipated this threat model, and they can prepare for it (for example, salt the keys).
But to put onto every user a hash that is collision resistent, but costs performance, is unjustified because not every user needs it.
User influenced/controlled input happens way more than we expect, I think the more sensible approach would be for the map to be safe by default, and to reach for a high performance external map for those times when that particular data structure is the bottleneck.
Number of headers is limited by default to a relatively small value (10k), which in the worst case of hash collisions results in relatively fast lookup time due to tree-like implementation of HashMap.
Exactly, this data structure has to provide counter-measures to cope with the fallout from this other poor choice.
The fast data structures for this work don't need a tree here, they're open addressed instead, but this would mean they can't cope if your attacker can arbitrarily collide input, so, too bad.
Suppose we have 10'000 key->value pairs to store. A Swiss Table (a popular open addressed hash table design) will need space for 16384 contiguous pairs plus metadata†. So it'll do that allocation once, but with the closed addressing and trees you're paying to grow trees during insertion, you can't know which trees will grow and which are unused.
You're correct that the Swiss Table will see collisions, but an attacker can't choose them, so they're rare. Each collision costs us a search step, but because the Java design incurs a pointer chase (to find the tree) for every lookup that's actually the same price [one memory fetch] as a single collision for the Swiss Table, yet most of the time the Swiss Table sees less than 1.0 collisions per lookup.
† It's ensuring no more than 87.5% of the storage is used and then rounding up to a power of two because that means less work for each look-up step. 11429 would be enough space, but 16384 is the next largest power of two.
In risk management there exist several different approaches: accept, mitigate, transfer, avoid. What makes you think that this design must avoid certain risks? The developers of standard libraries like Java usually base their decisions on industry feedback, IIRC, Oracle team did have a lot of telemetry data and they are good engineers. Do you have any insights proving that their approach is wrong?
Time's arrow
Java is a Sun (Microsystems) language, Oracle just bought the entire company much later. So one problem here is that you're imagining this like it's a decision somebody made this century but it isn't, it's a decision from the early 1990s.
Computer Science is a young discipline. In the early 1990s the Open Addressed hash tables do exist but it's very usual to write the simple chained hash table with Closed Addressing and as I understand it that's exactly what is provided in Java 1.0. This data structure is today known to be a bad idea, and I've criticised the choice in say C++ to standardize the exact same data structure (as std::unordered_map) in 2011. But that's twenty years later.
At some point (I believe this century?) Java engineers went back and improved their hash table to use trees not chains, but they were constrained by the baked in hash, so they could not have picked radically modern data structures which require a different hash strategy without significant disruption.
When I got a CS degree the Introspective sort was so new I'd have to have read bleeding edge research papers to know about it. Today it's very silly if you claim you provide a standard library sort and it's not at least as good as Introsort.
The Swiss Table is from the 2010s, it's new enough that there's a Youtube video of them introducing this data structure, not because they did that specifically on purpose but because they're giving a talk and obviously all talks are filmed and uploaded to Youtube these days. Hopskotch hashing - another popular Open Addressing strategy - is from 2008.
So, the question isn't "Is this the best idea today?" but only "Given what they knew 30+ years ago, was this reasonable back then?" and yeah, of course it was reasonable back then but that doesn't magically mean it's the best choice now.
Edited: Added paragraph about the later choice to use trees.
But I don't think Java should go break all this stuff. I'm just explaining why - in fact - we might want some cryptographic properties for hash algorithms. It can be true both that Java made reasonable tradeoffs at the time (in the 1990s) and that you shouldn't copy those choices today.
There's a long list of things Java picked that in hindsight weren't a good idea and in many cases it would have been really hard to guess that when Java was invented - however today you should not choose what Java did. Biggest example: 16-bit char. Java was invented when UCS-2 might happen, and in that light a 16-bit char makes sense. A few years later it's obvious UCS-2 is not possible and so you're choosing UTF-8 or UTF-16 and that's easy, pick UTF-8. People who'd already bought into the 16-bit char, such as Microsoft and Sun's Java team, were stuck with UTF-16 as a booby prize.
Many programming languages do not use open addressing and prefer chaining (Java, C#, Golang, std::unordered_map in C++) for simplicity and flexibility. There are certain trade offs to be considered, it is not a simple choice of absolutely the best known algorithms. I don't think Java made the wrong choice here. Notably, there was always a possibility to introduce a specific collection with some modern algorithm as an alternative to HashMap. It did not happen not least because it was not considered that important. I'm sure that in cases where open addressing offers substantial benefits, developers make conscious choice and use the appropriate data structures.
Regarding strings, since Java 9 (released long time ago) they can be stored in byte arrays if they contain only Latin-1 characters and core team is open for further enhancements.
The cryptographic hash functions such as the SHA-2 family resist construction of a second pre-image, given N and H(N) there's no way for me to make M such that H(M) == H(N) but M != N that'll be faster than brute force trying all possible values for M. So far everything we have which achieves this is markedly slower than say FNV-1 which would once have been the most likely hashing algorithm for a hash table.
The one way password hashes, some of which are based on the cryptographic hashes, are slow by design. You should not use passwords, but on HN this seems like a lost cause so, assuming you insist on using passwords these algorithms are specifically what you need. Fighting about which one to use ("oh no, PBKDF2 isn't memory hard, blah blah blah") is also a bottomless hole of HN nonsnse, so, fine, pick whichever of them you're certain is the only correct one.
But we do ideally want some cryptographic properties for a hash to be used in a hash table and your instinct that just using a random factor and hoping won't work was correct. You need a correctly designed function, which is what SipHash is. What you want is some keyed function with a key K such that if I know N and H(K,N) but not K, I can't guess an M such that H(K,M) = H(K,N) but M != N except by brute force.
> Many programming languages do not use open addressing and prefer chaining (Java, C#, Golang, std::unordered_map in C++) for simplicity
This was true in Golang but today the map in Go is a modified Swiss Table which of course delivers improved performance. We've seen why Java is stuck where it is. The C++ std::unordered_map is known to have poor performance, C++ programmers have many alternatives [including of course Swiss Tables], such as a really nice modern offering from Boost.
I'm actually not sure about the guts of the C# hash tables at all, it's a complicated beast full of conditionally defined elements because it's reused by CLR internals as well as provided as a type for us users. It's not a conventional design of any sort, there seems to have been (which is common for Microsoft) some degree of Not Invented Here - just make it up as we go along. It's storing pre-calculated hashes with each key->value pair, it's tracking how many collisions it has, I dunno, I wouldn't do this, don't copy it without seeing hard numbers for why this is good.
For example, Python back in 2012: https://nvd.nist.gov/vuln/detail/cve-2012-1150.
The reason why this kind of thing should be the default is because it's unreasonable to expect this level of understanding from your average coder, yet most software is written by the later. That's why PL and framework design has been moving towards safety-by-default for quite some time now - because nothing else works, as proven by experience.
Second, this risk was reliably mitigated in Java as soon as it was discovered. Just because hash collisions may exist, it doesn’t mean they are exploitable. CVE for JDK was not fixed, because it has been taken care of elsewhere, in Tomcat etc, where meaningful validation could take place.
Context matters.
This is detailed in implementation notes comment here: https://github.com/openjdk/jdk/blob/56468c42bef8524e53a929dc...
For things that need to be secure, there are dedicated libraries, standard APIs, etc. that you probably should be using. For everything else, this is pretty much a non issue that just isn't worth ripping up this contract for. It's not much of an issue in practice and easily mitigated by just picking things that are intended for whatever it is you are trying to do.
https://docs.python.org/3/reference/datamodel.html#object.__...
It will be a very impactful work; I'm excited to see. Probably even a 1% improvement in String::hashCode will have an impact on global carbon footprint or so.
> Computing the hash code of the String “malloc” (which is always -1081483544)
Makes sense. Very cool.
> Probing the immutable Map (i.e., compute the internal array index which is always the same for the malloc hashcode)
How would this work? "Compute" seems like something that would be unaffected by the new attribute. Unless it's stably memoizing, but then I don't quite see what it would be memoizing here: it's already a hash map.
> Retrieving the associated MethodHandle (which always resides on said computed index)
Has this changed? Returning the value in a hash map once you've identified the index has always been zero overhead, no?
> Resolving the actual native call (which is always the native malloc() call)
Was this previously "lazyinit" also? If so, makes sense, though would be nice if this was explained in the article.
The index is computed from the hashcode and the size of the array. Now that the hash code can be treated as a constant, and the size of the array is already a constant, the index can be worked out at compile time. The JVM can basically inline all the methods involved in creating and probing the map, and eliminate it entirely.
I guess a @Stable attribute on the array underlying the map would allow for the elimination of one redirection: in a mutable map the underlying array can get resized so its pointer isn't stable. With an annotated immutable map it could be (though IDK whether that'd work with GC defrag etc). But that seems like relatively small potatoes? I don't see a way to "pretend the map isn't even there".
Even if the map is crucial for some reason, why not have the map take a simple value (like a unint64) and require the caller to convert their string into a slot before looking up the function pointer. That way the cost to exchange the string becomes obvious to the reader of the code.
I struggle to find a use case where this would optimize good code. I can think of plenty of bad code usecases, but are we really optimizing for bad code?
The most common such usage in modern web programming is storing and retrieving a map of HTTP headers, parsed query parameters, or deserialized POST bodies. Every single web app, which arguably is most apps, would take advantage of this.
I dont have the profiling data for this, so this is pure theoretical speculation. At the time you're shoving http headers, which is dynamic data that will have to be read at runtime, into a heap allocated datastructures inside the request handling. It kinda feel like doing a little xor on your characters is a trivial computation.
I don't envision this making any meaningful difference to those HTTP handlers, because they were written without regard for perfomance in the first place.
Your proposed solution is to have the user manually implement a hash table, but if you have a good optimizer, users can focus on writing clear code without bugs or logic errors and let the machine turn that into efficient code.
At first I thought the article was describing something similar to Ruby’s symbols
only strings that are known at compile time could possibly be compile-time hashed?
But the article is talking about strings in a running program. The performance improvements can apply to strings that are constants, but is created at run time.
I mean the developer has to create the StableValue field, but its access is optimized away.
> There is, furthermore, mechanical sympathy between stable values and the Java runtime. Under the hood, the content of a stable value is stored in a non-final field annotated with the JDK-internal @Stable annotation. This annotation is a common feature of low-level JDK code. It asserts that, even though the field is non-final, the JVM can trust that the field’s value will not change after the field’s initial and only update. This allows the JVM to treat the content of a stable value as a constant, provided that the field which refers to the stable value is final. Thus the JVM can perform constant-folding optimizations for code that accesses immutable data through multiple levels of stable values, e.g., Application.orders().getLogger(). > Consequently, developers no longer have to choose between flexible initialization and peak performance.
This is saying that the StableValue instance will have a non-final field annotated that way, not that there is no StableValue instance allocated. Note that the user-code-level field is final, so that's not the field being referred to here. In fact, this description is what makes me think that the StableValue object might exist even after JITting.
What would be the performance improvement in average java services?
Are there specific types of applications that would benefit a lot?
Does this make string.intern() more valueable? String caches?
It would be faster but not as blindingly fast. Combined with an immutable map, what it means is that the JVM can directly replace your key with its value, like the map is not even there. Because the key's hashcode won't ever change, and the map won't ever change.
> Does this make string.intern() more valueable?
No, String.intern() does a different job, it's there to save you memory - if you know a string (e.g. an attribute name in an XML document) is used billions of times, and parsed out of a stream, but you know you only want one copy of it and not a billion copies). The downside is that it puts the string into PermGen, which means if you start interning normal strings, you'll run out of memory quickly.
In the same way that if you wrote this C code:
const int x[] = {20, 100, 42};
int addten(int idx) { return x[idx] + 10; }
the C compiler would "just know" that anywhere you wrote x[2], it could substitute 42. Because you signalled with the "const" that these values will never change. It could even replace addten(2) with 52 and not even make the call to addten(), or do the addition.The same goes for Java's value-based classes: https://docs.oracle.com/en/java/javase/17/docs/api/java.base...
But it's a bit more magical than C, because _some_ code runs, to initialise the value, and then once it's initialised, there can be further rounds of code compilation or optimisation, where the JVM can take advantage of knowing these objects are plain values and can participate in things like constant-folding, constant propagation, dead-code elimination, and so on. And with @Stable it knows it that if a function has been called once and didn't return zero, it can memoise it.
> What if there are bucket collisions? Do immutable maps expand until there aren't any? Moreover, what if there are hash key collisions?
I don't know the details, but you can't have an immutable map until it's constructed, and if there are problems with the keys or values, it can refuse to construct one by throwing a runtime exception instead.
Immutable maps make a lot of promises -- https://docs.oracle.com/en/java/javase/17/docs/api/java.base... -- but for the most part they're normal HashMaps that are just making semantic promises. They make enough semantic promises internally to the JVM that it can constant fold them, e.g. with x = Map.of(1, "hello", 2, "world") the JVM knows enough to replace x.get(1) with "hello" and x.get(2) with "world" without needing to invoke _any_ of the map internals more than once.
What wasn't working until now was strings as keys, because the JVM didn't see the String.hash field as stable. Now it does, and it can constant fold _all_ the steps, meaning you can also have y = Map.of("hello", 1, "world", 2) and the JVM can replace y.get("hello") with 1
But interned strings can also reuse their hashcode forever.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base...
How it tells the JVM this? It uses the internal annotation @jdk.internal.ValueBased
https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/...
Probably depends on the use case, though I'm having trouble thinking of such a use case. If you were dynamically creating a ton of different sets that had different instances of the same strings, then, maybe? But then the overhead of calling `.intern` on all of them would presumably outweigh the overhead of calling `.hash` anyway. In fact, now that `.hash` is faster, that could ostensibly make `.intern` less valuable. I guess.
In short: a general purpose String substitute in Java would be an extremely poor idea.
If you mean the headline, Strings are a universal data type across programming, so claims of improving their performance gets more clicks than "this annotation that you have never heard about before makes some specific code faster", especially when it comes to getting the attention of non-Java programmers.
The Stable annotation is an optimization mechanism: a promise the developer makes to the compiler. It is on the developer to uphold.
This optimization is about avoiding even calling the method because the jvm knows that the value returned will be the same.
commit a136f37015cc2513878f75afcf8ba49fa61a88e5
Author: Kaz Kylheku <kaz@kylheku.com>
Date: Sat Oct 8 20:54:05 2022 -0700
strings: revert caching of hash value.
Research indicates that this is something useful in
languages that abuse strings for implementing symbols.
We have interned symbols.
* lib.h (struct string): Remove hash member.
* lib.c (string_own, string, string_utf8, mkustring,
string_extend, replace_str, chr_str_set): Remove
all initializations and updates of the removed
hash member.
* hash.c (equal_hash): Do not cache string hash value.
> This improvement will benefit any immutable Map<String, V> with Strings as keys and where values (of arbitrary type V) are looked up via constant Strings.Wait, what? But, that's inherently constant foldable without reasoning about string hash codes; we don't need them at all.
We examine the expression [h "a"]: lookup the key "a" in hash table h, where h is a hash literal object, that we write as #H(() ("a" "b)). It contains the key "a", mapping it to "b":
1> (compile-toplevel '[#H(() ("a" "b")) "a"])
#<sys:vm-desc: 8eaa130>
What's the code look like? 2> (disassemble *1)
data:
0: "b"
syms:
code:
0: 10000400 end d0
instruction count:
1
#<sys:vm-desc: 8eaa130>
One instruction: just return "b" from the static data register d0. The hash table is completely gone.The keys don't even have to be strings; that's a red herring.
Their goal wasn't to improve key lookups in hash tables, that is more or less just an example.It was to improve optimization of variables with lazy initialisation overall and the hash of String uses lazy initialisation.
Could you solve the empty string hashes to zero problem by just adding one when computing hash codes?