Put another way, don't give me a measure of center (mean) without a corresponding measure of spread (variance/standard deviation).
Providing a measure of center for something that is approximately normal doesn't obviate the need for a measure of spread--you could have two distributions with the same mean but wildly different spread of results due to variance.
- Sometimes the definition is complicated. E.g., I'd accept a computer 10x slower at most tasks if a particular small subset were sped up even 2x. No symmetric mean (geometric, arithmetic, harmonic, or otherwise) will appropriately capture that.
- Nobody agrees on the definition. Even for very regular workloads, parsing protobufs is very different from old-school ML is very different from reversing the endianness of an unaligned data fragment. Venturing into even more diverse workloads muddies the waters further.
The article basically gives up and says that the geometric mean is the least bad option because it has some nice properties (ones I don't think anyone cares about for this purpose...) and because it's what everyone expects.
That latter argument is at least based in something sound, but I think the real point here is that attempting to come up with a single definitive "speed" is foolhardy in the first place. The only point is to compare two architectures, but without a particular context in mind no such linear ordering exists, and that's before we get to the difficulties in defining it appropriately even when given a context.
Of course this comes with its own can of worms, like overfitting and such, but I could imagine a benchmarking solution that gives you a more granular look at which specific tasks an architecture performs well.
For the specific purpose of what the article presents, I feel something simpler like 10%/50%/90% times can be better (two systems that have times as 0.1/1/10 or 0.9/1/1.1 will have "the same average" but where one might be adequate the other one might not)
In practical settings, the way to characterize performance will be more clear from context, often giving one (-ish) clear metric for each dimension you care about. For example, if an endpoint has a hard timeout of 100 ms, then it's pretty interesting to look at the percentage of real world requests that are faster than 100 ms. If the same solution is also used in a setting where throughput is more important than latency, then an additional metric is probably needed for that use case. Multiple metrics are needed for multiple use cases to capture trade-offs being made between those use cases.
In the era of multivariate models, machine learning, and AI, some of the classic wisdom from good-old linear modeling gets overlooked.
At the very least, weights for each benchmark task are needed; complications of the cost model (beyond a binary old system/new system comparison) are also likely.
Measuring performance improvement by speedup of throughput is also often naive, since there are other dimensions (like power and latency) and complex requirements.
Say you have two benchmarks with different units, frames/second and instructions/second. You can't take the arithmetic mean of these unless you divide by some baseline first (or a conversion factor).
But the geometric mean has well-defined units of (frames * instructions)^.5/second. And the reason you can divide by another geometric mean is because these units are always the same.
Having coherent units isn't exactly the same as "physical meaning", but it's a prerequisite at the least.
Think of the case with two values, 2.00 and 2.42. Decompose it into three speedups… 2x, 1.1x, and 1.1x. The speedup 2x happens with probability 1. Each of the 1.1x speedups happen with probability 0.5. The geomean is 1 2x speedup and 1 1.1x speedup, giving 2.2x.
There are many such decompositions, that one is not unique. Exercise for the reader to show which conditions give you the geometric mean and explain why that is reasonable… I’m terribly sleep-deprived at the moment and this is where I stop.
There is of course no correct choice of mean here, just a bunch of different choices with different interpretations justifying them.
You’re attempting to describe a whole series of numbers with just one (or two) numbers.
Trying to come up with a good general purpose way to reduce/compress/aggregate data via a lossy algorithm is intractable.
While that all might sound obvious, it can be very hard to internalise this.
(And that’s before getting into the motivated reasoning that biased actors [aka normal people] will use to preference one lossy algorithm over another)
https://en.wikipedia.org/wiki/Moment_(mathematics)
The arithmetic mean is one of them, which would be an argument in favor of it.
arith-mean = E[x] , the first moment of x geo-mean = exp(E[log x]) , so log geo-mean = first moment of log x
They are both equivalent in amount of information preserved, but arithmetic preserves additive structure whereas geo preserves multiplicative structure
When in doubt, don't use the mean: prefer more robust estimates, as even with degenerate statistical distributions, there are still some "good numbers to report" like the mode or the median.
And if you don't know statistics, just use a plot!
However, When all else fails, define your own Von Neumann entropy. Figure out how often you compile GCC, FFT, or video compression, then compute probabilities (ratios) and multiply by logarithm of speedups for each use case. Sum them up and report it as machine/architecture entropy and you'll win every argument about it.