1 pointby cholmess214 hours ago1 comment
  • cholmess214 hours ago
    We’ve been experimenting with adding deterministic guardrails to LLM changes before merge.

    When swapping models or tweaking prompts, subtle regressions can slip in: – cost spikes – format drift – PII leakage

    Traditional CI assumes deterministic output, which LLMs aren’t.

    We built a small local-first CLI that compares baseline vs candidate outputs and returns ALLOW / WARN / BLOCK based on cost, drift, and PII.

    Curious how others are handling this problem:

    Are you snapshot testing?

    Using SaaS evaluation tools?

    Relying on manual review?

    Not gating at all?

    Would love to understand real workflows.