3 pointsby gabdiax7 hours ago2 comments
  • zippyman557 hours ago
    My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs. What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code. It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership.
    • gabdiax7 hours ago
      That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

      In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

      I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

  • gabdiax7 hours ago
    One thing I'm particularly curious about is whether teams see early signals in metrics or logs before incidents actually happen.

    For example: - unusual latency patterns - slow resource saturation - network anomalies

    Do people actively monitor these patterns or mostly rely on threshold alerts?