6 pointsby huss972 hours ago1 comment
  • nkko2 hours ago
    FWIW I work at Steel (not the OP). While we’ve been iterating on the “right shape” for agent tooling, I’ve been building a benchmark harness to measure how different surfaces affect real web task completion: raw API context, CLI-only, opinionated “skills” (structured outputs + artifact capture), and combinations.

    If you’ve run agents on the open web, I’d love suggestions for nasty-but-representative workflows to include in the benchmark.