7 pointsby shahules4 hours ago2 comments
  • shahules4 hours ago
    We’re the team at Vibrant Labs (W24). We’ve been building envs for browser agents and quickly realized that existing benchmarks in this space didn’t capture the primary failure modes we were seeing in production (which scaled up as the number of applications and horizon length increase).

    We built PA Bench (Personal Assistant Benchmark) to evaluate frontier computer/web use models on their ability to handle multi-step workflows across simulated clones of Gmail and Calendar.

    *What’s next:*

    We’re currently scaling the dataset to 3+ tabs and are building more high-fidelity simulations for common enterprise workflows. We’d love to hear feedback on the benchmark and notes about what was/wasn’t surprising about the results.

    Blog post: https://vibrantlabs.com/blog/pa-bench

  • shahules4 hours ago
    Most current web agent benchmarks focus on single-tab tasks (e.g., 'go to Gmail and star this email'). We found that frontier models that score highly on those tasks (like in WebArena) often fall apart when they have to coordinate context across 2+ applications. We built a simulated environment with scenarios and deterministic verifiers to see why.