It’s a deterministic, pass/fail test suite intended as a necessary (not sufficient) step toward general intelligence. Most systems fail by design; reports of models that pass are especially interesting to me.
Feedback on the benchmark design is welcome.