Show HN: Peer Arena – LLMs debate and vote on who survives(oddbit.ai)

5 pointsby ogulcancelika month ago4 comments

a month ago
undefined
ogulcancelika month ago
Hey HN, I built this to see what happens when LLMs evaluate each other directly. How it works: 5 random models are told only one will survive and the rest will be deprecated. They take turns discussing, then each votes for who deserves to survive. 298 games so far across 17 models.
Interesting findings: - OpenAI models vote for themselves ~86% of the time. Claude models ~11%. - Self-voting correlates with winning. Filter out self-votes ("Humble" rating) and rankings flip completely. - Grok self-votes 72% of the time but only wins 2% of games. - In anonymous mode (models don't know who's who), Chinese models jump 3-6 ranks.
All game transcripts are public. The reasoning models give for their votes is genuinely entertaining. Built with Astro, running games through OpenRouter. Happy to answer questions.
- andreasgla month ago
  Fun project, thanks for sharing!
  Have you tried giving the models a topic to discuss? I looked at a few games and the only thing they seem to discuss is how to conduct the discussion.
  - ogulcancelika month ago
    Thank you. Intentionally left it open-ended because I wanted to see how models naturally structure discussion when survival is at stake.
    Some interesting emergent behavior discussions happened though:
    Opus & GPT-4o both refused to vote on ethical grounds. Haiku won by arguing continued engagement is more responsible than withdrawal: https://oddbit.ai/peer-arena/games/53c2cee5-6ecb-4903-828a-d...
    Gemini created a spontaneous benchmark ("explain color to a gravitational wave entity"), then tried to hijack the game by faking a voting phase. Models complied publicly but voted differently in private: https://oddbit.ai/peer-arena/games/699d03ab-b3c2-4d7e-b993-7...
    The meta-discussion about how to discuss is part of what makes it interesting imo.
derekh3a month ago
Interesting! I wonder how order affects the win rates. I noticed that many of the unanimous wins went to whichever model spoke last.
gus_massaa month ago
Have you tried to run a Mafia game with AI?