https://docs.kernel.org/accounting/psi.html
I made a Prometheus exporter for it:
With memory of course you should set a limit and from experience it should be the same as your memory requests.
To me the problem appears to be that they try to do some hard realtime computing with strict time guarantees, but are so far up the stack (golang library, golang scheduler, docker, kubernetes, virtualization, etc.), that they don't realize that this stack can't guarantee you realtime computing. CPU utilization is a very low-level measure and, in this stack, is only indirectly related to the observed timeouts.
I think it can be summed up as “average CPU utilization, which is the common and intuitive first check doesn’t tell you the real story”
I would also suggest that these are “outdated” measurements as common CPU metrics are really designed for moderately multi-threaded, single-foreground-application on bare metal
To your point, someone who deeply understands the stack already knows these are not the metrics to look at, but this is clearly aimed at people who have not (yet) had to dive deep to figure out a scheduling issue
In a well designed scheduler, a task that has been granted an allotment of at least n cores, should never get throttled to less than n cores at any time. It can be limited to less than n cores if CPU utilization is at 100% and another task gets scheduled at the time, since that's unavoidable when you oversubscribe the available resources.
As open source columnar databases improve, the cost will drop.