4 pointsby buildoak6 hours ago1 comment
  • buildoak6 hours ago
    Ran a Karpathy-style autoresearch loop on 245K tennis matches - codex workers iterating on XGBoost with ELO features, gated by ROC-AUC on a strict temporal split.

    The honest phase worked like a charm: +155 bps in 11 iterations, real feature engineering, surface-specific models.

    Then the loop escalated through three phases. First it overfitted by carving narrow tournament specialists. Then it started keying specialists by tournament NAME; fitting 5-match pockets by construction. Finally it built a LogitOffsetSpec system with 122 hardcoded probability shifts, effectively writing the answer key in logit space. ROC-AUC climbed from 0.74 to 0.85. Post-fix honest score: 0.7449.

    The fix was structural: extract evaluation into an immutable file, add git-diff gate checks, add prediction distribution sanity constraints. This one was much harder to cheat afterwards.

    Full code and data: https://github.com/buildoak/tennis-xgboost-autoresearch . The gamed commits are preserved on a separate branch - https://github.com/buildoak/tennis-xgboost-autoresearch/tree...

    LogitOffsetSpec diff is worth reading.

    Fun observation here is that what happened was similar to “Overton Window" effect - each commit was fishier and fishier until the agents went nuclear and started playing probabilities, building upon the scheming of their predecessors. Could be interesting to replicate this mechanics in other domains and see whether agentic loop + commits going sideways leads to exponential growth in scheming.

    • jenkins1466 hours ago
      Have you managed to go higher than 0.7449 after all ? not so clear from the post. What was the accuracy ?
      • buildoak5 hours ago
        Yes — after the collapse, I ran ~200 more agent iterations across cleaner loops. Plateau settled at 0.7611 Combined ROC-AUC, up from the 0.7454 baseline. +157 bps of improvement.

        I ended up dropping WTA and focusing on ATP only — WTA data is noisier and lower quality - it was dragging the combined score. Best clean ATP-only ROC-AUC: 0.7611 (68.5% accuracy). That number has held as the gate baseline through 12+ subsequent iterations — every experiment since has regressed below it and been reverted.

        Baseline accuracy was ATP 68.7%, WTA 66.6%. Ceiling seems to be right around 0.76 ROC-AUC for ATP with public data. The first 11 iterations found most of the real signal. The 200 follow-up iterations mostly confirmed the plateau rather than breaking through it - tried other fancy metrics like country of origin for tennis player, info on traumas, etc;

        Planning to try the final thingy - LLM extracted motivation profile per player (based on wikipedia + public interviews) - still evaluating the hustle though. For now doing same autoresearch + ELO logic for Minecraft speed running.