Show HN: Open-source playground to red-team AI agents with exploits published(github.com)

19 pointsby zachdotai3 hours ago4 comments

hellocr72 hours ago
I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
- zachdotaian hour ago
  Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.
  You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.
  Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...
  The leaderboard should start populating once we have more submissions!
agentpiravi2 hours ago
[dead]
3 hours ago
undefined
spranab40 minutes ago
[dead]