In practice it performed worse than expected. Once you overlay dense bounding boxes and numeric IDs, the model has to solve a brittle symbol-grounding problem (“which number corresponds to intent?”). On real pages (Amazon, Stripe docs, etc.) this led to more retries and mis-clicks, not fewer.
What worked better for me was moving that grounding step out of the model entirely and giving it a bounded set of executable actions (role + visibility + geometry), then letting the LLM choose which action, not where to click.
Curious if others have seen similar behavior with vision-based agents, especially beyond toy demos.