Like, an eval will tell you the model gave a bad answer. It won't tell you that your agent passes that answer straight into a shell command, or that a loop has no exit condition and burns through your API budget overnight.
We've been working on this, static analysis that reads agent code and maps out what can go wrong before you deploy. Found issues in ~80% of the repos we scanned.
would be great to get your feedback: https://github.com/inkog-io/inkog