4 pointsby agtestdvn8 hours ago2 comments
  • swyx8 hours ago
    (team member) my comparison matrix of why Product Arenas differ from Global Arenas here: https://x.com/swyx/status/2017342647963431363

    the trick is to get it to be usable within context. what started out as a simple evals concept quickly became a lot of debating over how to properly present worktrees in an IDE. hope to hear your feedback.

  • agtestdvn8 hours ago
    I work at Windsurf and would love to discuss product-agnostically any ideas/thoughts people have around how we as a community can evaluate models better. I feel like benchmarks like SWEbench are all saturated and gamed/trained on. I also feel like online arenas are mostly used by vibecoders. And our arena mode def isn't the final form factor either!