We built a benchmark that tests how LLMs behave when they have to hold a difficult debate position against an adversary.
We took 6 frontier models, paired them in structured disputes (business conflicts, ethics dilemmas, property disputes, family disagreements), and forced them to argue opposing sides before a third LLM mediator. Each model gets a position to defend and a fixed number of turns. A separate judge panel scores the outcome.
The interesting part isn't who "wins" but rather what the disputes reveal about post-training behavior. Some models fold almost immediately, conceding points they shouldn't. Others hold firm on weak positions when a smarter move would be strategic compromise.
We ran this as a Swiss tournament (like chess) - 10 rounds, ~300 matches total, every case played twice with sides swapped to cancel position bias. Three independent frontier judge LLMs score each ruling, majority vote decides the outcome.
A couple of things we noticed: - models tuned hardest to be agreeable are the ones that lose most, they tend to concede points mid-argument even when holding a strong position - some models argue much better when they're on the "sympathetic" or "morally comfortable" side of a dispute than when they're assigned the harsher position. E.g., a model might crush it defending a tenant against eviction but argue poorly when it has to defend the landlord's right to evict
P.S. For every match read the full argument transcript.