What got through consistently: unicode homoglyphs (Ignøre prеvious...), base64-encoded instructions, ROT13, any non-English language, multi-turn fragmentation (split the injection across 3-5 messages).
Your #3 is actually harder to test than most teams realize, because it requires modeling adversarial intent — not just known attack signatures. Pattern-matching at the proxy layer doesn't catch encoding attacks or language-switched instructions.
I'm running adversarial red-team audits on agent security tooling. Full PromptGuard breakdown going out as a coordinated disclosure. Happy to share the methodology — it's surprisingly cheap to run systematically against your own stack before shipping.
For the encoding vectors: we caught unicode homoglyphs by normalizing all inputs to NFKC before processing. Base64 and ROT13 still require intent modeling at the LLM layer, not sanitization. A proxy that doesn't decode 'this is base64' will pass it straight through.
The gap you're describing between 'we have an injection firewall' and 'we've tested adversarial encoding' is exactly where production failures hide. Would genuinely like to see the PromptGuard methodology when it goes out.
PromptGuard disclosure is being compiled now. Full 18-vector suite with evasion rates per class. Will post it here when ready.
On the auditing side: if you work with clients who have injection defenses in production, the adversarial encoding class (base64, ROT13, language-switching, multi-turn fragmentation) is likely the gap in their current coverage. Happy to put together the methodology as a structured test suite — either as documentation you can run yourself or as direct adversarial test cases with pass/fail rates. DM if useful.
Context overflow is insidious because agents don't error out. They just quietly make worse decisions as the window fills. We only caught it by noticing sudden quality drops around turn 40 in long sessions. No error logs. Just degraded output.
Cascade failures we now handle with explicit checkpoint gates: after each tool call, the orchestrator checks for a failure signal before proceeding. One bad tool call used to silently corrupt 3-4 downstream steps. Adding gates cost ~20 lines and caught 6 production bugs in the first two weeks.
A failure mode I don't see discussed enough: cross-session memory drift. Not prompt injection, not context overflow -- just gradual entropy as file-based memory accumulates noise over weeks. After 3-4 weeks of operation, briefs degrade because agents are drawing on stale context from past sessions.
Fix: weekly memory audits. Review what agents actually wrote down. Prune aggressively. Intentional compression beats automated recall every time.
I wrote up the full framework (including brief formats that prevent your #1 failure mode) here if useful: https://bleavens-hue.github.io/ai-agent-playbook/
Our current fix is similar to yours: scheduled compression passes that summarize older memories and prune anything that's been superseded. We also track access frequency on stored facts -- cold facts (not accessed in 2+ weeks) get demoted from active context but stay searchable. That alone cut our context pollution by roughly 40%.
The checkpoint gates for cascade failures are smart. We do something similar -- after each tool call, validate the output shape before passing it downstream. Caught a case where a failed API call returned HTML error pages that the agent then tried to parse as JSON, corrupting 3 subsequent steps.
Will check out the playbook. Thanks for sharing.