the trick is to get it to be usable within context. what started out as a simple evals concept quickly became a lot of debating over how to properly present worktrees in an IDE. hope to hear your feedback.
I work at Windsurf and would love to discuss product-agnostically any ideas/thoughts people have around how we as a community can evaluate models better. I feel like benchmarks like SWEbench are all saturated and gamed/trained on. I also feel like online arenas are mostly used by vibecoders. And our arena mode def isn't the final form factor either!