Ask HN: How do you shut down misbehaving AI in production?

2 pointsby nordic_lion8 hours ago2 comments

jamiemallers7 hours ago
We run AI agents in production and our kill chain looks roughly like this:
1. Circuit breakers per-agent with token/cost ceilings. If an agent burns through more than X tokens in Y seconds, it gets hard-stopped at the proxy layer before the request even hits the model provider. This catches runaway loops fast.
2. Tool-level allowlists with runtime revocation. Each agent has an explicit list of tools/APIs it can call. We can revoke individual tool access without killing the whole agent — useful when you discover it's hammering one specific external service.
3. Graceful degradation before kill. For non-critical paths, we drop to a cached/static fallback rather than killing outright. Full kill is reserved for safety-critical cases (data leakage risk, unauthorized external calls).
4. The actual kill mechanism is boring on purpose: a feature flag that gates the agent entrypoint, backed by a fast-propagating config system (sub-second). Kubernetes pod kills are too slow when you need to stop something mid-execution.
The thing we learned the hard way: observability without automated circuit breakers is just watching a fire. Our first incident was a prompt loop that we could see clearly in traces but took 8 minutes to manually kill because the on-call had to figure out which deployment to roll back. Now the circuit breaker fires automatically and pages the human to decide whether to re-enable.
Biggest gap I still see: there's no good standard for "agent-level observability" the way we have for microservices. Traces help but they don't capture the semantic intent of what an agent was trying to do when it went off the rails.
- nordic_lion6 hours ago
  Solid breakdown. Yeah, the agent-level observability gap you call out is real.
  One direction been considering is to treat intent and plan state as first-class runtime signals (for example, intent spans and plan-step annotations alongside traces), so when agent goes offrail can at least see what it believed it was doing, not just which calls it made.
zachdotai8 hours ago
I found it more helpful to try and "steer" the LLM into self-correcting its action if I detect misalignment. This generally improved our task success completion rates by 20%.
- nordic_lion6 hours ago
  Where/how do you define the policy boundary line that triggers course correction?
  - zachdotai5 hours ago
    Basically through two layers. Hard rules (token limits, tool allowlists, banned actions) trigger an immediate block - no steering, just stop. Soft rules use a lightweight evaluator model that scores each step against the original task intent. If it detects semantic drift over two consecutive steps, we inject a corrective prompt scoped to that specific workflow.
    The key insight for us was that most failures weren't safety-critical, they were the agent losing context mid-task. A targeted nudge recovers those. Generic "stay on track" prompts don't work; the correction needs to reference the original goal and what specifically drifted.
    Steer vs. kill comes down to reversibility. If no side effects have occurred yet, steer. If the agent already made an irreversible call or wrote bad data, kill.
    nordic_lion2 hours ago
    One thing I’m still unclear on: what runtime signal is the soft-rule evaluator actually binding to when it decides “semantic drift”?
    In other words, what is the enforcement unit the policy is attached to in practice... a step, a plan node, a tool invocation, or the agent instance as a whole?