2 pointsby ootakamoku4 hours ago1 comment
  • ootakamoku4 hours ago
    I have developed an LLM Benchmark. This project evaluates multiple aspects of Large Language Models by forcing them to compete in well-known zero-sum games, specifically Chess, Go, and Texas Hold'em.

    The primary experimental protocol strictly prohibits the models from accessing a current board state or a list of legal actions. They receive only sequential delta updates and must reconstruct the global game state autoregressively. In addition to the moves, the models also have to provide a probability estimate on their move being legal according to the rules.

    Performance across the entire benchmark is quantified using a Bradley-Terry model with global maximum-likelihood optimization. This mathematical model estimates the probability of one language model defeating another based on the outcomes of their direct matchups. To fully evaluate a model under these extreme cognitive constraints, the Bradley-Terry ratings are divided into three distinct pillars:

    Syntax Reliability: This metric isolates matchups that terminate prematurely due to syntax failures or illegal actions. It actively penalizes models that fail to maintain required constraints, generating a Bradley-Terry rating based strictly on format and rule adherence.

    Strategic Skill: To isolate true strategic reasoning independent of formatting errors, this metric evaluates games that conclude successfully without any syntax failures or illegal actions. It generates a Bradley-Terry rating that measures a model's capacity to outsmart opponents when both agents play without execution errors.

    Epistemic Calibration (Metacognition): Head-to-head ROC-AUC comparisons are calculated from the move legality predictions and randomly sampled via bootstrapping to simulate pairwise outcomes. These simulated win/loss records are then fed into the Bradley-Terry model to generate a rating that ranks models strictly by their self-awareness and their ability to detect when their internal state becomes unreliable.

    Executing thousands of API requests through OpenRouter to achieve statistically significant confidence intervals requires a substantial financial expenditure. I am currently seeking external funding to execute additional matches. Additional funding directly correlates to tighter confidence intervals and the inclusion of new models on the leaderboard.

    I am available to answer questions regarding the experimental protocol or the mathematical frameworks utilized.