source: I wrote a bunch of this code and I’ve tested it fairly extensively.
Using the final version, you would just make the cache refresh() function emit the clock adjustment log entry instead of actually caching anything. Then, any later log entry TSC would implicitly be relative to that clock adjustment log entry when you decode the log. Worst case you would need to persist every clock adjustment entry even when sampling, but that would still only be on the order of a few KB/s at worst and you could still drop entrys with no non-clock adjustment entrys between them.
If the thing of interest just runs on the CPU briefly, tracing is not what you want. You want a profiler that only runs when you're looking at it. Distributed tracing is for things that can go wrong and take uncertain amounts of time.
In this case distributed tracing absolutely was the right choice. These were not simple computational tasks. The components were highly stateful and interconnected both on- and cross-host. Between this and the timescale, as well as the volume of events and the dollar-value impact of each potential failure (of which there were many), we needed real-time analysis capabilities, not a profiler.