I've spent the last few years researching GC for my PhD and realized that the ecosystem lacked standard tools to quantify GC CPU overhead—especially with modern concurrent collectors where pause times don't tell the whole story.
To fix this blind spot, I built a new telemetry framework into OpenJDK 26. This post walks through the CPU-memory trade-off and shows how to use the new API to measure exactly what your GC is costing you.
I'll be around and am happy to answer any questions about the post or the implementation!
One thing that I still struggle with, is to see how much penalty our application threads suffer from other work, say GC. In the blog you mention that GC is not only impacting by cpu doing work like traversing and moving (old/live) objects but also the cost of thread pauses and other barriers.
How can we detect these? Is there a way we can share the data in some way like with OpenTelemetry?
Currently I do it by running a load on an application and retaining its memory resources until the point where it CPU skyrockets because of the strongly increasing GC cycles and then comparing the cpu utilisation and ratio between cpu used/work.
Edit: it would be interesting to have the GC time spent added to a span. Even though that time is shared across multiple units of work, at least you can use it as a datapoint that the work was (significantly?) delayed by the GC occurring, or waiting for the required memory to be freed.
You've hit on the exact next frontier of GC observability. The API in JDK 26 tracks the explicit GC cost (the work done by the actual GC threads). Tracking the implicit costs, like the overhead of ZGC's load barriers or G1's write barriers executing directly inside your application threads, along with the cache eviction penalties, is essentially the holy grail of GC telemetry.
I have spent a lot of time thinking about how to isolate those costs as part of my research. The challenge is that instrumenting those barrier events in a production VM without destroying application throughput (and creating observer effects) is incredibly difficult. It is absolutely an area of future research I am actively thinking about, but there isn't a silver bullet for it in standard HotSpot just yet.
Something that you could look at there are some support to analyze with regards to thread pauses is time to safepoint.
Regarding OpenTelemetry. MemoryMXBean.getTotalGcCpuTime() is exposed via the standard Java Management API, so it should be able to hook into this.
This will not help directly when developing new (versions) or GC. On the other hand, if we can have a noop GC including omitting any of the barriers etc required for GC to function we can create a baseline for apps. Provided we have enough total memory to run the benchmark in.
Edit: I guess we can then also use perf to compare cache misses between runs with different GC implementations and settings. Not sure how this works out in real life as it will be very CPU, kernel, and other loads dependent.
You could stop collecting performance counters around GC phases, but you even if you are not measuring the CPU still runs through its instructions, causing the second order effects. And as you mentioned too-short-to-measure barriers and other bookkeeping overheads (updating ref counters etc) or simply the fact that some tag bits or object slots are reserved all impact performance.
There is a good write-up of the problem and a way to estimate the cost based on different GC strategies, as you suggested, here: https://arxiv.org/abs/2112.07880
The way I found to measure a no-GC baseline is to compare them in an accurate workload performance simulator. Mark all GC and allocator related code regions and have the simulator skip all those instructions. Critically that needs to be a simulator that does not deal with the functional simulation, but gets it's instructions from a functional simulator, emulator or PIN tool that does execute everything. It's laborious, not very fast and impractical for production work. But, it's the only way I found to answer a question like "What is the absolite overhead of memory management in Python?". (Answer: lower bound walltime sits around +25% avg, heavily depending on the pyperformance benchmark)
In Figure 3 you somehow have 101% wall time. :)
Regarding the colors and thread counts in Figure 4: the key piece of context here is that the application thread (the Main thread) is completely paused during this phase. It isn't actually running anything at all. Because the application is halted, only the GC threads are doing active work. Therefore, rather than three threads running at once, we strictly have two things running concurrently. This is a helpful piece of feedback and I'll make sure to make this clearer in future writings.
Good eye on the 101% wall time. That was due to a minor bug in my plotting script that specifically affected the GC plots with no concurrent time. I have corrected this and updated the post. The fixed plot should be visible on the site in a future near you just as soon as the edge caches invalidate.
Not strictly related to this post, but I figured it'd be helpful to get an authoritative answer from you on this.
Will the new metric be exposed in JFR recordings as well?
It is not currently exposed in JFR for JDK 26, but I agree that it would be the logical next step. Now that the underlying telemetry framework (cpuTimeUsage.hpp) is in place within HotSpot, wiring it up to JFR events would be a natural extension.
https://github.com/jmxtrans/jmxtrans
Kind of amazing how people are still building telemetry into Java. Great post and great work. Keep it up.
Devs keep on trying to sneak in G1GC or ZGC because they hyper focus on pause time as being the only metric of value. Hopefully this new log:cpu will give us a better tool for doing GC time and computational costs. And for me, will make for a better way to argue that "it's ok that the parallel collector had a 10s pause in a 2 hour run".
ZGC and G1 are fantastic engineering achievements for applications that require low latency and high responsiveness. However, if you are running a pure batch data pipeline where pause times simply don't matter, Parallel GC remains an incredibly powerful tool and probably the one I would pick for that scenario. By accepting the pauses, you get the benefit of zero concurrent overhead, dedicating 100% of the CPU to your application threads while they are running.
I also find that the parallel collector is often better than G1, particularly for small heaps. With modern CPUs, parallel is really fast. Those 200ms pauses are pretty easy to achieve if you have something like a 4gb heap and 4 cores.
The other benefit of the parallel collector is the off heap memory allocation is quiet low. It was a nasty surprise to us with G1 how much off heap memory was required (with java 11, I know that's gotten a lot better).
- The number of edges in a graph tend to scale superlinearly with heap size, as the number of edges possible in a graph are quadratic wrt no of objects.
- Memory bandwidth hasn't been scaling very much during the past decade and a half, even compared to memory size. It's also not a thing people think about or even easy to display in any performance monitoring tool.
But considering if you had a machine 15 years ago with 4GB or ram that could be read at 15GB/s, and now you have one with 32GB that can be read at 60GB/s, it means that your bandwidth compared to heap size has halved. Considering the quadratic nature of references, the 'amplification factor', the number of times you have to revisit an already visited block of memory is higher as well.
This is in addition to the cache trashing issues mentioned in the post.
If you need to read the whole heap, this sets a lower bound on how much time the GC will take ~0.25s on the old machine, ~0.5s on the new one.
Suppose your GC triggers a memory bandwidth issue - how do you even profile for that? This is kind of an invisible resource that just gets used up.
It also deceived programmers into failing to manage complex lifecycles. Debugging wasted memory consumption is a huge pain.
- Purely on the JVM, you probably want ZGC (or Shenandoah) because latency is more important than throughput.
- On Erlang / the BEAM VM, each thread gets its own private heap, so GC is a per thread operation. If the request doesn't spill over the heap then GC would never need to run during a request handler and all memory could be reclaimed when the handler finishes.
- There can still be cases where a request handler allocates memory that is no solely owned by it. E.g. if it causes a new database connection to be allocated in a connection pool, that connection is not owned by the request handler and should not be deallocated when the handler finishes.
- The general idea you're getting at is often called "memory regions": you can point to a scope in the code and say "all the memory can be freed when this scope exits". In this case the scope is the request handler. It's the same idea behind arena or slab memory allocation. There are languages that can encode this, and do safe automatic memory management without GC. Rust is an obvious example, but I don't find it very ergonomic. I think the OxCaml [1] and Scala 3 [2] approaches are better.
[1]: https://oxcaml.org/documentation/stack-allocation/reference/
[2]: https://docs.scala-lang.org/scala3/reference/experimental/cc...
[1] https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ft...
The short answer is: It's something I'm actively thinking about, but instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult.
I've used a sampling profiler with success to find lock contention in heavily multithreaded code, but I guess there are some details that makes it not viable for this?
In general, for a production JVM like HotSpot, the implicit cost comes largely from the barriers (instructions baked directly into the application code). So even if we disable GC cycles, those barriers are still executing.
If we were to remove barriers during execution, maintaining correctness becomes the bottleneck. We would need a way to ensure we don't mark a live (reachable) object as dead the moment we re-enable the collector.
Thank you for the well written article!
One huge issue is spatial locality. Epsilon never reclaims, whereas other GCs reclaim and reuse memory blocks. This means their L2/L3 cache hit rates will be fundamentally different.
If you compare them, the delta wouldn't just be the barrier overhead; it would be the barrier overhead mixed with completely different CPU cache behaviors, memory layout etc. The GC is a complex feedback loop, so results from Epsilon are rarely directly transferable to a "real" system.