CV and direct mouse/kb interactions are the “base” interface, so if you solve this problem, you unlock just about every automation usecase.
(I agree that if you can get good, unambiguous, actionable context from accessibility/automation trees, that’s going to be superior)
It was a somewhat naive attempt, but it didn't look like they performed well without perhaps much additional work. I wonder if there are models that do much better, maybe whatever OpenAI uses internally for operator, but I'm not clear how bulletproof that one is either.
These models weren't trained specifically for UI object detection and grounding, so, it's plausible that if they were trained on just UI long enough, they would actually be quite good. Curious if others have insight into this.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...
Used it to write programs that would run in the background & spook my friends by "typing" quotes from movies at random times on their computer.
It’s how I accidentally learned the Win32 API
Q: How do you identify the AOL window? A: Look for an app with titlebar = "America[space][space]Online"
Preferably one that is similarly able to understand and interact with web page elements, in addition to app elements and system elements.
For web page elements, you could drive the browser via `do JavaScript` or use a dedicated browser MCP (Chrome DevTools MCP, Playwright MCP).
I guess I can answer, "yes I think so."