You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.
Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...
The leaderboard should start populating once we have more submissions!