We should get rid of average CPU utilization(www.theocharis.dev)

28 pointsby JeremyTheo6 hours ago10 comments

arianvanp5 hours ago
A more general metric that is useful to watch for is pressure stall information for CPU, IO and Memory.
https://docs.kernel.org/accounting/psi.html
I made a Prometheus exporter for it:
https://github.com/arianvp/cgroup-exporter
- JeremyTheo5 hours ago
  Yes!
JanMa5 hours ago
I've learned the hard way that CPU resource limits in K8S are a bad idea, as can be seen in this post. Just use CPU requests without limits so the scheduler has an estimate of your applications CPU requirements, but it can burst to use more CPU when it's available.
With memory of course you should set a limit and from experience it should be the same as your memory requests.
- cassianoleal2 hours ago
  This, very much. With memory, I have seen one or two use cases where it made sense to have bigger limits than requests but it's the exception rather than the norm.
- JeremyTheo5 hours ago
  There is also the concern that a single pod shouldn’t be able to take down an entire node. So there needs to be some safety levels. But then also not. I find this is a really complex issue which is not widely known (only in Kubernetes bubble)
  - ralgozino5 hours ago
    you can reserve node resources for system processes so the pods don't kill the node using some kubelet parameters: https://kubernetes.io/docs/tasks/administer-cluster/reserve-...
nairboon5 hours ago
No, not at all. Why get rid of a low-level statistical measure? It's not even quite clear what the article argues against. htop doesn't even show you "average CPU utilization", it provides a sample of the current CPU utilization.
To me the problem appears to be that they try to do some hard realtime computing with strict time guarantees, but are so far up the stack (golang library, golang scheduler, docker, kubernetes, virtualization, etc.), that they don't realize that this stack can't guarantee you realtime computing. CPU utilization is a very low-level measure and, in this stack, is only indirectly related to the observed timeouts.
- joshspankit2 hours ago
  > It's not even quite clear what the article argues against.
  I think it can be summed up as “average CPU utilization, which is the common and intuitive first check doesn’t tell you the real story”
  I would also suggest that these are “outdated” measurements as common CPU metrics are really designed for moderately multi-threaded, single-foreground-application on bare metal
  To your point, someone who deeply understands the stack already knows these are not the metrics to look at, but this is clearly aimed at people who have not (yet) had to dive deep to figure out a scheduling issue
CodesInChaos5 hours ago
It's well known that many throttling implementations are broken, usually by design. You shouldn't blame the CPU utilization metric for that footgun.
In a well designed scheduler, a task that has been granted an allotment of at least n cores, should never get throttled to less than n cores at any time. It can be limited to less than n cores if CPU utilization is at 100% and another task gets scheduled at the time, since that's unavoidable when you oversubscribe the available resources.
zeafoamrun6 hours ago
Same thing when it comes to memory. The rabbit hole goes on forever, and metrics lie to you if you don't know how to interpret them properly.
ahartmetz6 hours ago
No, we shouldn't. We should measure latency if we care about latency.
- jiggawatts6 hours ago
  I’ve come to realise that “wide logs” like OpenTelemetry traces are the only way to go, despite the expense of collecting and storing them with current technology.
  As open source columnar databases improve, the cost will drop.
VimEscapeArtist5 hours ago
Let’s measure temperature :)
techpression6 hours ago
Lovely read, if you’ve ever had even remotely similar issues (you think you’re looking at the right places but you’re not) it read like a detective novel.
rimworld6 hours ago
great article thanks
ksk236 hours ago
TLDR; if app slow, give more resources
- luipugs5 hours ago
  Or just don't put CPU limits: https://home.robusta.dev/blog/stop-using-cpu-limits
  - JeremyTheo5 hours ago
    Yeah, that is mainly the point there. But difficult if company internal policies require it (for security, etc)
- andrepd5 hours ago
  Writing better code is of course out of the question.
  - dgellow5 hours ago
    What do you mean, I always append “make it excellent” to all my prompts!
  - inglor_cz5 hours ago
    Shockingly many developers have never profiled any code in their life.
  - juanani5 hours ago
    [dead]