2 pointsby hardikvora5 hours ago1 comment
  • hardikvora5 hours ago
    built a benchmark for language models to test capability of solving chess puzzles (like mate in 1, mate in 2, fork, pin, hanging pieces).

    it's open-source and would love to see people build on it! it basically measures the spatial capabilities of these models and while I am unable to spend more for running new tests (since grok models take up soooo much cost but they're also turning out to be best at this), open to suggestions and would love to chat about it :)