Archestra's Dual LLM Pattern: Using "Guess Who?" Logic to Stop Prompt Injections(www.archestra.ai)

5 pointsby ildari2 hours ago2 comments

ildari2 hours ago
Hi HN, I'm Ildar from Archestra, we build an open-source LLM gateway. We've been exploring ways to protect AI agents from prompt injections during tool calls and added the approach, inspired by the game "Guess Who", where the agent can learn what it needs without ever seeing the actual result. See the details in the blog post we wrote
magicalhippo2 hours ago
I've tried some of these prompt injection techniques, and simply asked a few local models (like Gemma 2) if they thought it was very likely a prompt injection attempt. They all managed to correctly flag my attempts.
I know LLama folks have a special Guard model for example, which I imagine is for such tasks.
So my ignorant questions are this:
Do these MCP endpoints not run such guard models, and if so why not?
If they do, how come they don't stop such blatant attacks that seemingly even an old local model like Gemma 2 can sniff out?
- joeyorlando2 hours ago
  hey there
  Joey here from Archestra. Good question. I recently was evaluating what you mention, against the latest/"smartest" models from the big LLM providers, and I was able to trick all of them.
  Take a look at https://www.archestra.ai/blog/what-is-a-prompt-injection which has all the details on how I did this.
  - magicalhippoan hour ago
    Thanks. Interesting and scary such blatant attempts succeed. After all, all external data is evil, we all know that right?
    ildari34 minutes ago
    external data is unavoidable for the properly functioning agent, so we have to learn to cook it
- ildari2 hours ago
  Most mcp endpoints don’t run any models, the main model decides which tools the ai agent should execute, and if the agent passes results back into context, that opens the door to prompt injections.
  It’s really a cat-and-mouse game, where for each new model version, new jailbreaks and injections are found