4 pointsby agtestdvn8 hours ago2 comments

swyx8 hours ago
(team member) my comparison matrix of why Product Arenas differ from Global Arenas here: https://x.com/swyx/status/2017342647963431363
the trick is to get it to be usable within context. what started out as a simple evals concept quickly became a lot of debating over how to properly present worktrees in an IDE. hope to hear your feedback.
agtestdvn8 hours ago
I work at Windsurf and would love to discuss product-agnostically any ideas/thoughts people have around how we as a community can evaluate models better. I feel like benchmarks like SWEbench are all saturated and gamed/trained on. I also feel like online arenas are mostly used by vibecoders. And our arena mode def isn't the final form factor either!