In my setup, agents propose actions and write structured reports. A deterministic quality advisory then runs — no LLM involved — producing a verdict (approve, hold, redispatch) based on pre-registered rules and open items. The agent can hallucinate all it wants inside its context window, but the only way its work reaches production is through a receipt that links output to a specific git commit, with a quality gate in between.
For anything with real consequences (database writes, API calls, refunds), the pattern is: LLM proposes → deterministic validator checks → human approves. The LLM never has direct write access to anything that matters.
"Just hoping for the best" works until it doesn't. We tracked every agent decision in an append-only ledger — after a few hundred entries, you start seeing exactly where and how agents fail. That pattern data is more useful than any prompt guard.
The append-only ledger point is underrated too — pattern data from real failures is worth more than any upfront rule design.
How long did it take to build and maintain that governance layer? And as your agent evolves, do the rules keep up or is that becoming its own maintenance burden?
The maintenance question is the right one. The rules themselves are low-maintenance because they're deliberately simple and deterministic — file size limits, test coverage thresholds, blocker counts. They don't need updating when the model changes because they don't depend on LLM behavior.
What does evolve is the dispatch templates — how I scope tasks and what context I give agents upfront. That's where the ledger pays for itself. After 1100+ receipts, I can see patterns like "tasks scoped above 300 lines fail 3x more often" or "planning gates without explicit deliverables always need redispatch." Those patterns feed back into how I write dispatches, not into the rules themselves.
So the rules stay stable, but the way I use the system keeps improving. The governance layer is the boring part — the interesting part is the feedback loop from receipts to dispatch quality.
LLMs ignore instructions. They do not have judgement, just the ability to predict the most likely next token (with some chance of selecting one other than the absolutely most likely). There’s no way around that. If you need actual judgement calls, you need actual humans.
We landed on the same pattern: LLM handles the understanding, hard rules handle the permission. The tricky part is maintaining those rules as the agent evolves. How are you managing rule updates code changes every time or something more dynamic?
Serious question. Assuming you knew this, why did you choose to use LLMz for this job?
Yet they don't understand the intent of "Never do X" ?
Worth looking at islo.dev if you want the sandboxing piece without building it yourself.