I would think Microsoft, of all companies, would want to be working on their own LLM behind the scenes, even if they're relying on OpenAI for the bulk of their work.
Meta seems to be the only US company releasing big 'open source' models, while Chinese companies continue to release many completely open source LLMs.
This model in particular makes sense to be synthetic though. It’s explicitly trained to control a computer, and I doubt there’s a large enough amount of public training data on this use case.
I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west. There’s tons of stellar LLMs available from major US companies if you’re just using an API. It’s also a convenient marketing and differentiation opportunity. Some of the companies behind the bigger “agentic” models have started to offer a cheap subscription alternative to US companies. If they build up a big enough business I wouldn’t be surprised if they stop open sourcing right away.
Why not? That's the way to go. In some domains the only way to go.
Task Segment Tasks SoM GPT-4o-0513 SoM o3-mini SoM GPT-4o GLM-4.1V-9B OAI Comp-Use UI-TARS-1.5 Fara-7B Single-Site Tasks Shopping 56 62.5 71.4 38.1 31.0 42.3 41.1 52.4 Flights 51 60.1 39.2 11.1 10.5 17.6 10.5 37.9 Hotels 52 68.6 56.4 31.4 19.9 26.9 35.3 53.8 Restaurants 52 67.9 59.6 47.4 32.1 35.9 22.4 47.4 Activities 80 70.4 62.9 41.7 26.3 30.4 9.6 36.3 Ticketing 57 58.5 56.7 37.4 35.7 49.7 30.4 38.6 Real Estate 48 34.0 17.4 20.1 16.0 9.0 9.7 23.6 Jobs/Careers 50 49.3 44.0 32.7 22.7 20.7 20.7 28.0 Multi-Step Tasks Shopping List (2 items) 51 66.0 62.7 17.0 7.8 34.0 20.9 49.0 Comparison Shopping 57 67.3 59.1 27.5 22.8 1.2 8.8 32.7 Compositional Tasks 55 51.5 39.4 26.7 17.0 10.3 9.1 23.0 Overall
I've been playing with the Qwen3-VL-30B model using Playwright to automate some common things I do in browsers, and the LLM does "reasonably well", in that it accelerates finding the right ways to wrangle a page with Playwright, but then you want to capture that in code anyway for repeated use.
I wonder how this compares -- supposedly purpose made for the task, but also significantly smaller.
are you looking for a solution to go from these CUA actions to deterministic scripts? check out https://docs.stagehand.dev/v3/best-practices/caching
people have been experimenting with this since early Opus days.
Check out kRPC. Get it running (or make your agent get it running) and it's trivial for any of the decent models to interface with it
When I tried it with Opus3 I got a lot of really funny urgent messages during failures like "There has been an emergency, initiating near-real-time procedures for crew evacuation.." and then it's just de-couple every stage and ram into the ground.
Makes for a fun ant-farm to watch though.
I like how they classifythr sub problems of their work. Environment/ self questioning -> task / self questioning -> trajectory / self evaluation. OODA-esque.
https://arxiv.org/abs/2511.10395 https://github.com/modelscope/AgentEvolver with thanks to Sung Kim who has been a great feed https://bsky.app/profile/sungkim.bsky.social/post/3m5xkgttk3...
I bought a 12GB Nvidia card a year ago. In general I'm having a hard time to find the actual required hardware specs for any self hosted AI model. Any tips/suggestions/recommended resources for that?
You'll also need to load inputs (images in this case) onto the GPU memory, and that depends on the image resolution and batch size.
You're not finding hardware specs because there are a lot of variables at play - the degree to which the weights are quantized, how much space you want to set aside for the KV cache, extra memory needed for multimodal features, etc.
My rule of thumb is 1 byte per parameter to be comfortable (running a quantization with somewhere between 4.5 and 6 bits per parameter and leaving some room for the cache and extras), so 7 GB for 7 billion parameters. If you need a really large context you'll need more; if you want to push it you can get away with a little less.
The Q4_K_S quantized version of Microsoft Fara 7B is a 5.8GB download. I'm pretty sure it would work on a 12GB Nvidia card. Even the Q8 one (9.5GB) could work.
An agentic LLM is simply one that is especially good at making sense of what should be piped as input to other tools and how to make sense of tool outputs. Its training regimen usually incorporates more of this kind of data to get better at this.