2 pointsby tonywwa day ago1 comment
  • tonywwa day ago
    One thing I didn’t emphasize enough in the post: I originally tried the “labeled screenshot + vision model” approach pretty hard. (see this screenshot labeled with bbox + ID: https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...)

    In practice it performed worse than expected. Once you overlay dense bounding boxes and numeric IDs, the model has to solve a brittle symbol-grounding problem (“which number corresponds to intent?”). On real pages (Amazon, Stripe docs, etc.) this led to more retries and mis-clicks, not fewer.

    What worked better for me was moving that grounding step out of the model entirely and giving it a bounded set of executable actions (role + visibility + geometry), then letting the LLM choose which action, not where to click.

    Curious if others have seen similar behavior with vision-based agents, especially beyond toy demos.