I know LLama folks have a special Guard model for example, which I imagine is for such tasks.
So my ignorant questions are this:
Do these MCP endpoints not run such guard models, and if so why not?
If they do, how come they don't stop such blatant attacks that seemingly even an old local model like Gemma 2 can sniff out?
Joey here from Archestra. Good question. I recently was evaluating what you mention, against the latest/"smartest" models from the big LLM providers, and I was able to trick all of them.
Take a look at https://www.archestra.ai/blog/what-is-a-prompt-injection which has all the details on how I did this.
It’s really a cat-and-mouse game, where for each new model version, new jailbreaks and injections are found