I think AI vibe codes that because it’s probably seen that default so much.
If you're a company and you have several GPU machines in a cluster, then this is kinda useless b/c you'd have to go on each container or node to view the dashboard.
Sure, there's a cost to using opentelemetry + whatever storage+viz backend, but once it's set up you can actually do alerting, historical views, analysis, etc. easily.
I suppose it’s trivial to proxy a http port over ssh though so that would seem like a good solution
I think it’s super cool. Clean design. Great for your local self-hosted system or one of your local company systems in the office.
If you have a fleet of GPUs then maybe use your SSH CLI. This is fun and cool looking though.
I did notice that nvidia-smi shows the process name as plex-transcoding but gpu-hot is showing [Not Found]. Not sure if that is where the process name is supposed to go
Does not change the usefulness of this dashboard, just wanted to point it out.
But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).
Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...
For more information: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multi...
It does not at all take into count how much that thread is actually using the core to it's capacity.
So if e.g. your thread is locked waiting on some data from another GPU (NCCL) and actually doing nothing, it will still show 100% utilisation. A good way to realize that is when a NCCL call timeout after 30 minutes for some reason, but you can see all your GPUs (except the one that cause the failure) were at 100% util, even though they clearly did nothing but wait.
Another example are operation with low compute intensity: Say you want to add 1 to every element in a very large tensor, you effectively have to transfer every element (let's say FP8, so 1 byte) from the HBM to the l2 memory, which is very slow operation, to then simply do an add, which is extremely fast. It takes about ~1000x more time to move that byte to L2 than it takes to actually do the add, so in effect your "true" utilization is ~0.2%, but nvidia-smi (and this tool) will show 100% for the entire duration of that add.
Sadly there isn't a great general way to monitor "true" utilization during training, generally you have to come up with an estimate of how many flops your model requires per pass, look at the time it takes to do said pass, and compare the flops/sec you get to Nvidia's spec sheet. If you get around 60% of theoretical flops for a typical transformer LLM training you are basically at max utilization.
But when you get to the point where you care about a few percentage points of utilisation it's just not reliable enough as many things can impact energy consumption both ways. E.g. had a case were the GPU cluster we were using wasn't being cooled well enough, so you would gradually see power draw getting lower and lower as the GPUs were throttling themselves to not overheat.
You can also find cases were energy consumption is high but MFU/HFU isn't, like memory intensive workloads
It's useless on CPUs as well, just to a much much lesser extent to the point of it actually being useful.
Basically, the OS sees the CPU as being composed of multiple cores, that's the level of abstraction. Thus, the OS calculates "portion of last second where atleast one instruction was sent to this core" on each core and then reports it. The single number version is an average of each core's value.
On the other hand, the OS cannot calculate stuff inside each core - the CPU hides as part of its abstraction. That is, you cannot know "I$ utilisation", "FPU utilisation", etc,.
In the GPU, the OS doesn't even see each SM (streaming multiprocessor, loosely analogous to a cpu core). It just sees the whole GPU as one black box abstraction. Thus, it calculates utilisation as "portion of last second where atleast one kernel was executing on the whole GPU". It cannot calculate intra-GPU util at all. So one kernel executing on one SM looks the same to the OS, as that kernel executing on tens of SMs!
This is the crux of the issue.
With performance counters (perf for CPU, or nsight compute for GPU), lots of stuff visible only inside the hardware abstraction can be calculated (SM util, warp occupancy, tensor util, etc)
The question then, is why doesn't the GPU schedule stuff on each SM in the OS/driver? Instead of doing it in a microcontroller in the hardware itself on the other side of the interface?
Well, I think it's due to efficiency reasons and also for nvidia to have more freedom to change it without having compat issues due to being tied to the OS, and similar reasons. If that were the case however, then the OS could calculate util for each SM, and then average it, giving you more accurate values - the case with the kernel running on 1 SM will report a smaller util than the case with the kernel executing on 15 SMs.
IME, measuring on nsight compute causes anywhere from a 5% to 30% performance overhead, so if that's ok for you, you can enable it and get more useful measurements.
// solves everything at the above container claims to do lol
check also netdata amazing project
python-socketio==5.8.0: 1 CVE (CVE-2025-61765); Remote Code Execution via malicious pickle deserialization in multi-server setups.
eventlet==0.33.3: 1 CVE (CVE-2025-58068); HTTP request smuggling from improper trailer handling.
And then economists wonder why are none of these people getting jobs...Eventlet .33 is ancient, no idea why they would use that.
With this said, most people should have some kind of SCA to ensure they're not using ancient packages. Conversely picking up a package the day it's released has bit a lot of people when the repository in question gets pwned.