Seeing the hard-task numbers here makes that make a lot more sense.
Honestly the more interesting thing to me is the benchmark critique. WebVoyager being the default eval while only agreeing with humans 62% of the time is kind of damning for the whole space. Has anyone else tried running their agent against Online-Mind2Web?
That suggests the architecture handles state accumulation across steps without compounding errors — which is the thing that kills most agent pipelines. Every other agent here shows exponential degradation as task length increases, which is what you'd expect from a naive screenshot-action loop with no error recovery.
Looking at the cookbook repo — are you doing any kind of structured DOM extraction before passing to the model, or is this pure vision? Curious whether the hard-task performance comes from better perception, better planning, or better recovery when an action doesn't produce the expected state change.
Operator goes from 83% easy → 43% hard. That's a 40-point cliff.
Claude Computer Use: 90% easy → 32% hard. 58-point drop.
Browser Use: 55% easy → 8% hard. Just falls off a cliff entirely.
TinyFish: 97.5% easy → 81.9% hard. 15-point drop.
The gap between easy and hard is where you see if a system actually works or if it's just good at simple tasks. Every other agent loses half its ability or more when tasks get complex. We lose 15 points.
That's the difference between "cool demo" and "I can actually ship this."
The failure traces being public is a nice touch. Looked through a few and they're actual failures, not cherry-picked easy ones. Most companies in this space wouldn't do that.
Curious about latency though, what does a typical hard task execution look like in terms of wall clock time?
A lot of these sites serve different layouts, A/B tests, cookie consent modals, etc. across sessions. Did you control for that across agents, or is each agent hitting the live site independently at different times?
Because if so, some of the variance between agents could just be "Operator happened to get the GDPR popup and didn't know how to dismiss it." Would be useful to know if all agents were evaluated on the same snapshots or same time window.