When swapping models or tweaking prompts, subtle regressions can slip in: – cost spikes – format drift – PII leakage
Traditional CI assumes deterministic output, which LLMs aren’t.
We built a small local-first CLI that compares baseline vs candidate outputs and returns ALLOW / WARN / BLOCK based on cost, drift, and PII.
Curious how others are handling this problem:
Are you snapshot testing?
Using SaaS evaluation tools?
Relying on manual review?
Not gating at all?
Would love to understand real workflows.