In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.
I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.
For example: - unusual latency patterns - slow resource saturation - network anomalies
Do people actively monitor these patterns or mostly rely on threshold alerts?