If there are new bugs every day that need fixed is the AI really good enough to know the fix from just an error?
Apps written in an exceptions language (Java, JavaScript, PHP, etc..) are really annoying to monitor as everything that isn't the happy path triggers an 'error'/'fatal' log/metric.
Yes, you can technically work around it with (near) Go-level error verbosity (try/catches everywhere on every call) but I've never seen a team actually do that.
Modern languages that don't throw exceptions for every error like Rust, Go, and Zig make much more sane telemetry reports in my experience.
On this note, a login failure is not an error, it's a warning because there is no action to take. It's an expected outcome. Errors should be actionable. WARN should be for things that in aggregate (like login failures) point to an issue.
Login failure is like the most important error you'll track. A login failure isn't necessarily actionable but a spike of thousands of them for sure is. No single system has been more responsible for causing outages in my career than auth. And I get that it's annoying when they appear in your Rollbar but sometimes Login Failed is the only signal you get that something is wrong.
Some 3rd party IdP saying "nope" can be innocuous when it's a few people but a huge problem when it's because they let their cert/application token expire.
And I can already hear the "it should be a metric with an alert" and you're absolutely right. Except that it requires that devs take the positive action of updating the metric on login failures vs doing nothing and letting the exception propagate up. And you just said login failures aren't errors and "bad password" obviously isn't an error so no need to update the metric on that and cause chatty alerts. Except of course that one time a dev accidentally changed the hashing algorithm. Everyone was really bad at typing their password that day for some reason.
And for many alerts you need to look at other events around it to properly classify and partially solve them. Due to that you need to give the AI more then just the alerts.
Through I do see a risk similar to wrongly tuned alerts:
Not everything which resolves by itself and can be ignored _in this moment_ is a non issue. It's e.g. pretty common that a system with same rare ignoble warns/errs falls completely flat, when on-boarding a lot of users, introducing a new high load feature, etc. due the exactly the things which you could fully ignore before hand.
Understanding what your code looks like in production gives you a lot better sense of how to update it, and how to fix it when it does inevitably break. I think having AI checking for you will make this basically impossible, and that probably makes it a pretty bad idea.
I'm not sure I'd do this once a day. I tend to take note of things to build that intuition when I have other reasons to go and look at dashboards, and we have a weekly SLO review as a team, but perhaps there's a place for this in some way.