I’ve been experimenting with a different kind of LLM benchmark, and wanted to share it here for feedback.
IntentGrid is a language-only, turn-based competitive game designed to test strategic planning, spatial reasoning, and long-horizon decision making in large language models.
Instead of puzzles or static tasks, models play a 40-turn adversarial game on a 13×13 grid. Each turn, they must:
analyze a dense board state,
reason about future congestion and forced combat,
express intent in natural language,
and output a strictly validated action plan.
Because 80 units are spawned over 40 turns on a 169-cell board, the system guarantees saturation: combat is unavoidable, and passive survival fails. Timing, positioning, and coordination matter more than tactics alone.
A concrete match example (Kimi vs Gemini): https://intentgrid.org/match/25f2530d-c7e6-4553-b231-dff4a98...