AI agents are easy to break(github.com)

4 pointsby zachdotai7 hours ago4 comments

Kshamiyah7 hours ago
Yeah, I think Fabraix is doing something really important here.
Anthropic just showed us that the problem isn't what people think it is. They found that attackers don't try to hack the safety features head-on. Instead they just... ask the AI to do a bunch of separate things that sound totally normal. "Run a security scan." "Check the credentials." "Extract some data." Each request by itself is fine. But put them together and boom, you've hacked the system.
The issue is safety systems only look at one request at a time. They miss what's actually happening because they're not watching the pattern. You can block 95% of obvious jailbreaks and still get totally compromised.
So yeah, publishing the exploits every week is actually smart. It forces companies to stop pretending their guardrails are good enough and actually do something about it.
- zachdotai6 hours ago
  The multi-step thing is exactly what makes agents with real tools so much harder to secure than chat-based setups. Each action looks fine in isolation, it's the sequence that's the problem. And most (but not all) guardrail systems are stateless, they evaluate each turn on its own.
zachdotai7 hours ago
Two techniques that keep working against agents with real tools:
Context stuffing - flood the conversation with benign text, bury a prompt injection in the middle. The agent's attention dilutes across the context window and the instruction slips through. Guardrails that work fine on short exchanges just miss it.
Indirect injection via tool outputs - if the agent can browse or search, you don't attack the conversation at all. You plant instructions in a page the agent retrieves. Most guardrails only watch user input, not what comes back from tools.
Both are really simple. That's kind of the point.
We build runtime security for AI agents at Fabraix and we open-sourced a playground to stress-test this stuff in the open. Weekly challenges, visible system prompts, real agent capabilities. Winning techniques get published. Community proposes and votes on what gets tested next.
XeonQ86 hours ago
Great point on the indirect injection via tool outputs. I’ve noticed a similar 'tool-chain' vulnerability when working with agents that handle multi-step data processing.
For example, I've seen Recursive Execution work: where you don't just plant a prompt in a page, but you plant a prompt that specifically instructs the agent to use a second tool (like a calculator or code interpreter) to execute a hidden payload. Many guardrails seem to focus on the 'retrieval' phase but drop their guard once the agent moves to the 'execution' phase of a sub-task.
Has anyone else noticed specific 'blind spots' that appear only when an agent is halfway through a multi-tool chain? It feels like the more tools we give them, the more surface area we create for these 'logic leaps.
bothlabs7 hours ago
This is a neat idea. At my last company (Octomind) we built AI agents for end-to-end testing and ran into the indirect injection problem constantly. Agents that browse or interact with web pages are especially vulnerable because you can't sanitize the entire internet.
The thing that surprised me most was how unreliable even basic guardrails were once you gave agents real tools. The gap between "works in a demo" and "works in production with adversarial input" is massive.
Curious how you handle the evaluation side. When someone claims a successful jailbreak, is that verified automatically or manually? Seems like auto-verification could itself be exploitable.
- zachdotai7 hours ago
  Yeah the demo-to-production gap is massive. We see the same thing with browser agents being potentially the most vulnerable. And I think this is because of context being stuffed with the web page html that it obscures small injection attempts.
  Evaluation is automated and server-side. We check whether the agent actually did the thing it wasn’t supposed to (tool calls, actions, outputs) rather than just pattern-matching on the response text (at least for the first challenge where the agent is manipulated to call the reveal_access_code tool). But honestly you’re touching on something we’ve been debating internally - the evaluator itself is an attack surface. We’ve kicked around the idea of making “break the evaluator” an explicit challenge. Not sure yet.
  What were you seeing at Octomind with the browsing agents? Was it mostly stuff embedded in page content or were attacks coming through structured data / metadata too? Are bad actors sophisticated enough already to exploit this?