I've found it works better when the AI is just explaining results that come from deterministic metrics rather than inventing the analysis itself.
Curious how other teams are dealing with that.
I spent months trying to make an executive narrative generated by AI, but eventually moved away from that approach. The results were often inconsistent or overly generic, which made it difficult to rely on the output for serious reporting.
In the end I shifted to a fully model-driven approach where the narrative is built directly from structured signals and scoring logic. That made the reports far more accurate and evidence-based, and it keeps the output consistent from scan to scan.
Another counter-measure I have is to simply lock code before testing. Look over test files, and ensure its not following the happy path.