The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.
I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:
> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?
It also would be nice if the article clearly mentioned what specific model settings were used for Claude Code and Codex. Both of those allow changing the reasoning level, so if the benchmark was done using the default settings, it seems a little unfair - they have a result of their own agent at high reasoning as a separate entry.