5 pointsby ramoz7 hours ago1 comment
  • oliver_dr6 hours ago
    We've been dealing with this at multiple layers. Here's what actually works in production:

    Input-side (preventing injection):

    - Strict input sanitization with role-boundary enforcement in the system prompt. Sounds basic, but most people skip it.

    - Separate "user content" from "system instructions" at the API level. Don't concatenate untrusted input into your system prompt. Use the dedicated `user` role in the messages array.

    - For tool-calling agents, validate that tool arguments match expected schemas before execution. An LLM-as-judge approach for tool call safety is expensive but effective for high-stakes actions.

    Output-side (catching when injection succeeds):

    This is the part most people underinvest in. Even with perfect input filtering, you still need output guardrails:

    - Run the LLM output through evaluation metrics that score for factual correctness, instruction adherence, and safety before it reaches the user.

    - For RAG systems specifically, verify that the generated answer is actually grounded in the retrieved context, not fabricated or influenced by injected instructions.

    The "defense in depth" framing matters here. Input filtering alone has a ceiling because adversarial prompts evolve faster than regex rules. Output evaluation catches the failures that slip through. We use DeepRails' Defend API for this layer - it scores outputs on correctness, completeness, and safety, then auto-remediates failures before they reach end users. But the principle applies regardless of tooling: treat output verification as a first-class concern, not an afterthought.

    Simon Willison's work on dual-LLM patterns is also worth reading if you haven't: https://simonwillison.net/2023/Apr/25/dual-llm-pattern/