We originally built this out of pure frustration. While working on our own product (Emitta), we realized that having an LLM 'look' at a screen via vision and guess where to click was ridiculously slow, unreliable, and expensive.
We looked at MCP, but that's strictly data/tools. We looked at AG-UI and A2UI, but they require building net-new components. We just wanted the agent to operate the clunky, existing UI we already had. So we wrote a protocol that basically gives the agent a structured 'map' of the live DOM, and lets it send back native execution commands (like set_field, click).
The reference server is up on npm (@acprotocol/server). I’m around all day and would love to hear your thoughts on the architecture, if the action set (8 actions) makes sense to you, or if you think the native UI-control approach is the right path forward.