187 pointsby palashawas4 hours ago35 comments
  • angry_octet14 minutes ago
    Great guidance hidden in here for making it expensive for agents to navigate your website. Move elements on screen as the mouse moves, force natural mouse movement to make the UI work, change the button labels in the JS to be randomly named every visit, force scrolling to the bottom of the screen to check for hidden extra tasks...

    Hang on, that sounds like common corporate SaaS apps.

  • merlindru3 hours ago
    I'm building something that fixes this exact problem[1].

    The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.

    The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`

    Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.

    [1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet

    • ctoth2 hours ago
      If agents is what it finally takes to get good a11y I'll take it. I'll bitch about it, but I'll take it.
      • pjc503 minutes ago
        Very real risk of this going in reverse: people building inaccessible websites to prevent AI use.
      • tomjakubowski37 minutes ago
        Playwright, the end-to-end testing framework for the web, provides a strong incentive to give sites good a11y: Playwright tests are an absolute delight to read, write and maintain on properly accessible sites, when using the accessibility locators. Somewhat less so when using a soup of CSS selector and getByText()-style locators.

        One thing I am curious about is a hybrid approach where LLMs work in conjunction with vision models (and probes which can query/manipulate the DOM) to generate Playwright code which wraps browser access to the site in a local, programmable API. Then you'd have agents use that API to access the site rather than going through the vision agents for everything.

      • merlindru2 hours ago
        i think this goes both ways too :) agents have been a boon for everyone with disabilities, carpal tunnel, RSI, ADHD, anything

        and now the fact that interfaces need to be accessible to agents, not just humans, ironically increases it for humans in return

        • lopis22 minutes ago
          And lets not forget that not all disabilities are chronic. Many disabilities are situational or temporary. AI is a great assist for a hangover day for example...
    • gbriel3 hours ago
      This is a good solution, instead of everyone blowing tokens on repeating the same computer use task, come up with a way to share the workflows. I think you'd need to make sure there aren't workflows shared that extract user information (passwords).
      • merlindru3 hours ago
        this is protected against at the OS level, provided the applications declare the input correctly as a SecureTextField.

        i so far haven't found any application that doesn't.

        all you're able to get out, as far as i can tell, is the length of the entered password.

    • hellojimboan hour ago
      Isn't that basically what browser base does. I've found the hardest part of browser use to be stealth first then client change management then browser comprehension (which gets better with every new model).
      • merlindru43 minutes ago
        i'm not too familiar with browserbase, but invoke works with any macOS app (or at least the accessible ones), i think browserbase is only for browser usage.

        in the context of this blog post, the conclusion looks similar though!

        "use the whole web like it's an API"

        works much better than

        "figure out similar or identical tasks from a clean slate every single time you do them"

    • teej3 hours ago
      You should call it Braille
      • merlindru3 hours ago
        shit, why didn't i think of that

        i tend to think of invoke as "an API over macOS apps" tho...

        doesn't `invoke finder shareAndCopyLink` read very nicely? :P

  • Worfan hour ago
    Is it possible to ask the vision agent to "map" the UI and expose it to another agent as a set of interfaces that resemble an API better? From what I understand the vision agent now should both know that "next page" shows more results and that they need to get more results in the first place.

    If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?

    With an example UI I made up, the description (API-like interface definition) could be something like:

      Get all reviews:
    
      To get all the reviews you need to go to each page and click "show full review" for every review summary in that page.
    
      Go to each page:
    
      Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page).
    
    So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.

    Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.

    • angry_octet25 minutes ago
      I think you're right, you can get agents to do what we do -- learn how a website works. Then expose that model as a simple API. There will still be some vision tasks for navigation but they will be just vision tasks, no thinking required.
  • rahulycan hour ago
    All the websites currently blocking Claude Code or other AI agents are fighting a losing battle. Computer-use is in the early stages, and the thing preventing mass-adoption seems to be the number of tokens it takes. Agents can fumble around trying 10 CLI commands that don't work before finding the right one and we barely notice. But other visual agents (browser use / computer use etc) end up eventually fumbling on to the right thing, but we don't have the patience to wait 20 mins. to click a button. As tokens get cheaper + faster, we probably get the models that can use a UI interface just as natively as a CLI.
    • boringg35 minutes ago
      Tokens cheaper? I don't think that seems to be the case ... VC funded tokens were there to build user base and token price will go up as they eventually switch from growth to profitability.
      • Aurornis28 minutes ago
        I wish I could place a lot of money on the opposite side of this bet.

        I don't think many realize how could the cheap, alternative models are becoming. I prefer SOTA models for key work, but I can also spend 10X as many tokens on an open model hosted by a non-VC subsidized provider (who is selling at a profit) for tasks that can tolerate slightly less quality.

        The situation is only getting better as models improve and data centers get built out.

        • boringg14 minutes ago
          Fair - there are bets both ways though I wouldn't consider it to be a certainty. That revenue drive on this AI build out is going to be real and multifold.
      • bheadmaster32 minutes ago
        It will take a few years until scheduled data center construction finishes, and together with software optimizations that may come up in the meantime, it may cause a significant decrease in token price.
    • johnsmith184030 minutes ago
      And the lethal trifecta but I suppose that's all agents as of now anyhow. Every AI provider has major warnings about letting AI have access to PII in the browser.
  • jacktu3 hours ago
    Totally agree. I’ve been building an AI visual tool recently and experimented with both approaches. The latency and c ost of generic "agentic" browser use are absolute dealbreakers for real-time consumer apps right now. Structured APIs (even just chained LLM calls with strict JSON schemas) are not only 40x cheaper, but more importantly, they are deterministic enough to actually build a stable product on top of. Computer use is an amazing demo, but structured APIs are what pay the server bills.
    • ai_fry_ur_brain2 hours ago
      "Agentic engineering" were always just FADs to bring in more revenue for token providers.

      If I think an LLM is good for something I create well defined, very deterministic "middleware" for that purpose on top of Openrouter.

      • k__2 hours ago
        Agentic engineers can build well defined, very deterministic middleware on top of OpenRouter.

        Anthropic even says, that an agent based solution should only be your last resort and that most problems are well served with a one-shot.

        https://www.anthropic.com/engineering/building-effective-age...

        • ai_fry_ur_brainan hour ago
          Written 1.5 years ago. Anthropic would not advertise this stance today.

          I'm much more agreeable with that type of LLM workflow. Running "agents" with monolithic "harness" for long time horizon tasks seems wasteful, unecessary but probably super appealing to lazy people.

      • wahnfrieden2 hours ago
        It’s not a fad or without value.
        • ai_fry_ur_brainan hour ago
          Its very much valuable to lazy people who dont care about quality or doing hard things. I totally see the appeal for those people.
          • wahnfriedenan hour ago
            Sounds like you are more interested in performativity aesthetics of production
  • johnsmith184044 minutes ago
    Text based web browsing? Would love the comparison there. Tons of systems have a dom translation layer. I'm building around this with the concept of turn a webpage into text for an agent to use directly. I actually had to move away from haiku not because of accuracy problems but because it operated the browser too fast for a human to follow what it was doing. The real loss here are bespoke webapps like a figma or google docs which are near impossible to see what they are doing via the dom.

    To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.

    The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.

  • an hour ago
    undefined
  • orliesaurusan hour ago
    Computer Use? Or Browser Use? IMHO big diff

    The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.

    In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)

    Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.

    [1] https://stoplight.io/open-source/prism

  • etothet42 minutes ago
    Vision has a long way to go. I remember trying an early version of AWS's Nova Act and laughed at how slow it was. And a few months later it hadn't really seemed to improve that much.

    Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.

    Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.

    A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.

  • rgilliottean hour ago
    Many people are working on that :-)

    Apps written now will have mcp servers / AI compatibility when relevant

    The issue that still needs solving is how to make llms interact with everything we already have and use (efficiently, not with screenshot, read, screenshot, ...)

    Most of the time that means reverse engineering, either the app itself or the APIs it uses

    From github (not my projects):

    https://github.com/SimoneAvogadro/android-reverse-engineerin... => reverse engineer android app APIs from APKs

    https://github.com/HKUDS/CLI-Anything => convert ooen-source GUI apps to clis

    https://github.com/kalil0321/reverse-api-engineer => API reverse engineering from traffic (claude skills)

    My take at the same issue (very young project):

    Also api reverse engineering from traffic captures, with a focus on mobile app, safety & community mcp generation

    https://getspectral.sh

    https://github.com/spectral-mcp/spectral

  • antves3 hours ago
    I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experience

    In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)

    At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered

    We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be

  • sheepscreekan hour ago
    This tracks - has been my experience exactly. Not to mention there isn’t particularly a significant lift in inaccuracy or speed. As things stand, to me it is the worst of both worlds. Expensive and inaccurate.
  • aurareturn4 hours ago
    In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.

    I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

    • planb3 hours ago
      This will not happen. None of the existing apps people use daily on their phones have any incentive to support this. Social media wants the people to doomscroll, shopping apps and booking sites want to use their own dark patterns to make people believe they get a special discount if they buy _now_ and everything else just wants users to see the ads. Why on earth would they offer convenient hooks for AI chatbots?
      • input_sh3 hours ago
        It's even more fascinatingly dumb to have this discussion like 2 or so years after every major platform decided to kill any notion of 3rd party clients they used to support.

        Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.

      • aurareturn2 hours ago

          Why on earth would they offer convenient hooks for AI chatbots?
        
        Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.

        Watch how fast Meta adds this if a new hot shot social media app succeeds by designing for AI agents controlled by users.

        • JambalayaJimbo21 minutes ago
          >Competition. If I ask my OS-level AI assistant to find a social media reel about a elephant dancing, the social media app that exposes a set of APIs for an AI agent might get used more.

          This is the exact opposite of what will happen (and in fact what has happened). Reddit is suing Perplexity right now for scraping.

          Meta will not serve content to some other app for free - for what benefit? They will not see advertising data.

        • swiftcoder2 hours ago
          Having used a chatbot to find a reel Meta was censoring from search in the past... I'm not sure how well the incentives align
      • ai_fry_ur_brain2 hours ago
        These people are delusional and want to build a world thats convenient for them to accomplish things lazily with LLMs.

        There are no shortcuts in life and its just expensive text autocomplete.

        "Lets spin up $750k in GPUs full throttle to scrape a web page with my $200.00 CC subscription."

        Everyone is delusional.

      • jackphilson3 hours ago
        because the social media sites that do will outcompete once people get personal AI coaches that tell them to use technology that is better for them.
        • donaldjbiden3 hours ago
          How is an AI posting on your social media better for you?
          • kaashif3 hours ago
            It's not, but token peddlers will say it is. It's good to interact with everything through buying tokens.
            • charcircuit2 hours ago
              And how will a token peddler's social media company survive after the hype runs out?
    • tikhonj3 hours ago
      Everything exposed programmatically would have been great even without agents—the NixOSes and Emacses of the world show just how amazing a fully flexible and programmable world would be—but I'm glad that the advent of AI is getting people invested in this vision :P
    • pmontra3 hours ago
      I still have to understand what my AI agents could do that I don't want to do myself. Buy stuff? No thanks, I want to see what I buy. I think that they are 99% a solution in search of a problem.
      • sbrother3 hours ago
        Same. Well the biggest thing I don't want to do that they could help with is work. But in the cases where it can do that for me, there's no world where that benefit goes to me rather than my employer.
        • pmontraan hour ago
          Well, that's the very nature of the employer / employee relationship. In my case I write software for my customers and I trade time for money. If I use an AI to write code two times faster my daily rate doesn't double. However I can keep my costumers.

          That's only another step in the path I experienced since the 80s, when I had to type every single character because there was no auto complete, no command line history, very few libraries. I was very good at writing trees, hash tables, linked lists and so was everybody else. Nobody would hire me if I were that slow at writing code today.

    • joshstrange3 hours ago
      > I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

      This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.

    • zozbot234an hour ago
      > In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.

      So, like a Unix system?

    • mtoner234 hours ago
      Openai should not design a phone... They should try making money first
      • sophacles3 hours ago
        Nonsense. Don't you know how bubbles work? Everyone does massive rushes for all the low hanging and medium hanging fruit. The the bubble pops and the randomized carnage of companies big and small being destroyed is sifted through by the next wave of companies actually intended to make money.

        The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.

    • awongh3 hours ago
      At the beginning of the internet we were promised the free flow of digital information between computers, peer-to-peer. What we got was silos of content each fighting each other to make sure that the silos stay intact with DRM.

      I could imagine an AI future where agentic shopping companies who promise me the best deal are pitted against Walmart and Amazon, trying to algorithmically squeeze me for $2 more- just two bots playing a cat and mouse game to save me a few bucks.

      For some reason a lot of tech ends up in these antagonistic monopolies- Apple wants to sell privacy aware devices as a product feature, Google wants give you mail and maps, but sell your data. Despite any appearances neither give a shit about you, even if you benefit from the dynamic.

    • switchbak3 hours ago
      "In an agentic world, the OS needs to be completely rethought" - if AI is progressing as fast as we think it is, I don't think we'll be interested in waiting for the world to rebuild all the legacy tooling from the OS up. For new stuff, that'd be great.

      I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.

    • ssl-3an hour ago
      We'll just close the loop with a systemd MCP, set the shell to /usr/bin/codex, and find some other way to pay the bills.

      Perfect.

    • airstrike2 hours ago
      It doesn't need to be mobile. The AI-first OS will be headless, undoubtedly.

      Humans would be the second-class users of said OS, which can generate UIs on demand as needed.

      I've thought about this quite a bit. Started implementing as a side project, but I have too many side projects at the moment...

    • lazide3 hours ago
      This is like insisting - after the problem turns out to be harder than thought - that the worlds roads need to be completely redone to make them self driving friendly, so self driving can work.

      Isn’t the whole ‘promise’ of AI that it doesn’t need any of those things?

    • jnwatson2 hours ago
      Android is working on it. See AppFunctions.

      https://developer.android.com/ai/appfunctions

    • throwaway274483 hours ago
      We have a much better chance of an ai-addressable Harmony OS version than of OpenAI making a serious competitor.
    • FirestarAlpha3 hours ago
      That’s actually what the Reflex plugin behind the APIs in the benchmark does. It creates APIs from your app’s event handlers, thereby providing a stateful way for agents to navigate apps.

      It’s why we did this benchmark :) - reflex team member

    • CodingJeebus3 hours ago
      One of the most seductive (and destructive) forces in software is the desire to rewrite from scratch because rewrites never, ever, ever go as planned. With AI, we're now thinking it's a good idea to rewrite the entire platform from the ground-up. Wild.
      • convolvatron3 hours ago
        except every single piece of progress that we have is the result of trying to do things a different way. so unless you really think we've reached the pinnacle of operating system design, there has to be some room for this?
        • CodingJeebus2 hours ago
          There's a very big difference between building onto an existing system and rewriting from the ground up. I'm not opposed to making progress and trying things differently, but saying things like "we need to completely rethink the operating system" is like saying "we need to completely redesign New York City". The most effective progress is incremental, not throwing the old system away wholesale.

          The modern javascript ecosystem is a perfect example of what happens when everyone tries to rebuild from scratch and it's a nightmare.

    • donaldjbiden3 hours ago
      We used to have this. It was called OLE Automation.
    • pier253 hours ago
      And when the agent fucks up badly (as we've seen over and over again) who will be held accountable? The user?
    • dummydummy12343 hours ago
      Why not use the same acc disability features?
    • reorder96953 hours ago
      Presumably on Linux at least apps could just expose a DBus API? The machinery for this is already in place as far as I can tell.
    • shiandow3 hours ago
      Ah yes. The trains everywhere approach to self driving cars.
    • dist-epoch3 hours ago
      The future is "dark OSs" - OSes with no human users.
      • wartywhoa232 hours ago
        Launched to nuclear fanfare on August 29th.
    • QuercusMax3 hours ago
      Lots of apps actually do have all their functionality exposable via an API - but it's an internal API that's hidden from the user.
    • Rekindle80903 hours ago
      [dead]
  • janalsncm3 hours ago
    Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.

    The only reason you wouldn’t choose an API is if it wasn’t viable.

  • _boffin_3 hours ago
    What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after.

    idk.. not really thought out too much, but has to be better

  • arjunchintan hour ago
    The hard part about the web is that API's aren't just available even if the website owner wants them exposed (big if).

    I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.

    What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...

  • ai_fry_ur_brain2 hours ago
    Its funny watching the slow mean reversion back to more deterministic tooling.
  • Havoc3 hours ago
    Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.

    I can see the appeal in pixel route given universality but wow that seems ugly on efficiency

    • lelanthranan hour ago
      > Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.

      Not possible on wayland, maybe on X11 protocol?

    • donaldjbiden3 hours ago
      Wayland only has pixels. It was designed to get rid of all the X11 cruft.
    • QuercusMax3 hours ago
      imagine, if you will, that we had a windowing system that's built on Postscript... lots of folks thought it was a super awesome idea, and built NeXTSTEP around it. https://en.wikipedia.org/wiki/Display_PostScript

      or even one based on PDF like OSX: https://en.wikipedia.org/wiki/Quartz_2D

  • svnt4 hours ago
    > This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.

    > To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.

    This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?

    Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.

    • palashawas3 hours ago
      This is a fair point.

      The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares

      Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now

  • 2001zhaozhao2 hours ago
    I have only found Computer Use useful for GUI app local debugging. Presumably it will also be useful for getting around protections for external apps that don't want AI to interact with them, or for interfacing with legacy apps or those built without AI in mind.

    I don't think any new app should ever be specifically designed for AI to interact with them through computer use

  • rootcage3 hours ago
    The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.
    • noprocrasted3 hours ago
      Wouldn't using a coding agent to build a screenscraper be better?
  • sudb4 hours ago
    I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)

    There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.

    1. https://github.com/SawyerHood/dev-browser

    • palashawas4 hours ago
      Interesting! I'll play around with agent-browser and update this article if anything comes up
  • cjbarber4 hours ago
    I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.
  • overgard3 hours ago
    I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.)

    The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:

    - Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.

    - Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.

    - Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.

    Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.

    • peyton3 hours ago
      It’s great at

      1. things you wouldn’t otherwise bother doing

      2. things where it otherwise would get stuck iterating on hacky workarounds doomed to fail

      “Reverse engineer this app/site so we can do $common_task in one click”, “by the way, I’m logged in to $developer_portal, so try @Browser Use if you’re stuck”, etc.

      I just had Codex pull user flows out of a site I’m working on and organize them on a single page. It found 116. I went in and annotated where I wanted changes, and now it’s crunching away fixing them all. Then it’ll give me an updated contact sheet and I can do a second pass.

      I’d never do this sort of quality pass manually and instead would’ve just fixed issues as they came up, but this just runs in the background and requires 15 minutes of my time for a lot of polish.

      • overgard2 hours ago
        I guess the problem I see here is that if the use case is "things I otherwise wouldn't bother doing", that's fine, but it's pretty niche. I dunno, if you're talking about a human "Agent" (like say in sports or entertainment), they'd be a trusted person to handle business matters outside of your competency (contract negotiations, etc.). I don't see AI "agents" being at all like that, they're more like an intern you need to supervise constantly.
  • gowld3 hours ago
    Confusing title? "Computer Use" is actually "Browser vision"?
  • dist-epoch3 hours ago
    It doesn't matter.

    Electron uses 10x more RAM than regular apps. But it's so convenient.

    Python is 100x slower than C. It's in the top 3 of languages now.

    Worse but more convenient always wins.

  • RobRiveraan hour ago
    UX feedback

    Me: hmm, this title confuses and infuriates Rob.

    [Clicks link]

    Me: Sees same title, repeat feelings of confusion and infuration

    [Scrolls article down on my smartphone]

    Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.

    [Closes tab]

    [Continues living rest of my life]

    I hope this feedback is well received and understood.

  • moralestapia4 hours ago
    This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.
    • palashawas3 hours ago
      Right - we did this benchmark because we launched a plugin that makes APIs programmatically from an app's human-oriented UI (from the event handlers, to be specific). So any app that has a human-oriented UI now has an API.

      The benchmark is a more generally interesting part of the launch materials, so I figured it had its own separate home here.

      • moralestapia3 hours ago
        That is actually great, I'll definitely check it out. Thanks!
  • theabhinavdas9 minutes ago
    For now.
  • zephen2 hours ago
    I find this extremely surprising.

    When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.

  • sanderjd2 hours ago
    Only 45x?
  • taormina4 hours ago
    The interface designed for humans is poor for AI needs? And the interface designed for programmatic use is easier for the AI to use? In other news, the sky is blue and water is wet.
    • palashawas4 hours ago
      Yep, everyone knows computer use is more expensive. This is about quantifying the gap
  • 4 hours ago
    undefined
  • volume_tech3 hours ago
    [flagged]
  • faangguyindia3 hours ago
    I saw Codex was screenshotting, then clicking around. I just stopped it and never used that again.

    Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.

    I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...

    I developed all this on $20 codex sub.

    • embedding-shape3 hours ago
      I think it's the third or forth time I see you bragging about HN how many apps you're able to develop with AI now. Care to link any of them, especially where we can see the actual code that you've produced here? Without being able to see actual results, I'm not sure what you want people to take away from your repeated comments.
      • faangguyindia3 hours ago
        I only write here because people are spreading doomerism here with AI and I am excited about future.

        Well I am competing with geoip provider like maxmind.

        I developed custom traceroute and ping service to geolocate IPs with very high accuracy beating products like digital element, maxmind, ipinfo

        These companies have huge teams. But my 3 people company already beat them.

        Code doesn't matter much, it's not an opensource project.

        My free app is http://macrocodex.app which I've developed along with a fitness coach.

        I am currently beating companies with 20-30 developers and closing more deals while having 1/10th of the staff.

        I am simply very excited about all this.

        Nobody cares show you solve the problem, or if your code is ugly. As long as it's reliable and without downtime, you aren't breaking things and causing your customer headache, you are winning.

        Even before AI, bad code existed. Not every company had 10x developer writing beautiful idiomatic rust code.

        AI is just a tool, people who are trying to generate whole codebase with it are doing something very wrong. You can write code faster with AI provided you understand its strength and weakness

        • embedding-shape2 hours ago
          > Code doesn't matter much, it's not an opensource project.

          Heh, you're in for a rude awakening, sometime in the future :) But I won't spoil the surprise, you clearly have made up your mind about what to focus on.

          > My free app is http://macrocodex.app which I've developed along with a fitness coach.

          Crazy, this app you've run for ~1-2 months has 10K active users already, even though there is zero info about who runs it, zero reviews, and says "Download on the App Store" on the landing page even though you then ask people to use the web app, impressive.

          I don't think anyone said using AI can't produce a ton of code really quickly, and no one is finding that difficult to manage either. But most of us software engineers are trying to build long-lasting codebases with AI too, then "less === better" typically, so it's not about being able to spit out features as fast as possible, but avoid the evergrowing codebase from collapsing on top of itself, and each prompt not getting slower and slower, but as fast as on a greenfield project.

          Sounds like you've found the holy grail of being able to avoid that, kudos if so. Judging by you giving zero care to how the design and architecture actually is, I kind of find that hard to believe. But, if it works for you, it works for you, not up to me or others to dictate how you build stuff, hope you enjoy it, however you build stuff :)

      • nonameiguess3 hours ago
        Why even bother asking a guy with the statistical acumen to think he can make a reliable estimate of a monthly average from some span of time shorter than two months? He's probably just going to say it doesn't matter and unfortunately he's probably right. If you sound excited enough, you can convince other people and close deals, so who gives a shit if there's really a there there? We'll see how he's doing in another decade. Reminds me of my sister always trying to get into real estate and mortage brokerage speculation, glowing whenever there's a market spike about people pulling in 200 grand a month, yet 25 years later she's still broke, doesn't own her own house, and her daughter is constantly asking me for money instead of her.
    • ceejayoz3 hours ago
      Claude does this too, with the Chrome extension.

      It breaks like 80% of the time for me, and it's incredibly slow. Having it use Playwright (bonus: can test in FF/Saf too) was a big improvement.