I have built a similar project for mobile automation [1] and the validator phase is not separate rather it's inherently baked in each step since we only ask next step based on current UI and previous actions.
My Planner sometimes goes "Oh, we are still are home screen, let's find the Uber app icon". This sort of self-correcting behaviour was not programmed but the LLM does it on its own.
1. https://github.com/BandarLabs/ClickClickClick - A framework to automate mobile use via any LLM (local/remote)
This one example alone has so many branches that would require knowing what’s in my head.
On the flip side, it’s a ridiculously simple task for a human to do for themselves, so what am I truly saving?
Call me when I can ask it to check the professional reviews of X category on N websites (plus YouTube), summarize them for me, and find the cheapest source for the top 2 options in the category that will arrive in Y days or sooner.
That would be useful.
Let's say that you are a parts procurement shop and want to order 10,000 of SKU1, and 20,000 of SKU2. If you go on parts websites like finditparts.com, you'll see that there is little ambiguity when it comes to ordering specific SKUs
We've seen cases of companies that want to automate item ordering like this on tens of different websites, and have people (usually the CEO) spending a few hours a week doing it manually.
Writing a script can take ~10-20hours to do it (if you know how to code).. but we can help you automate it in <30 minutes with Skyvern, even if you don't know how to code!
(Idk if we can trust a human either - brain farts are a thing after all, but at least humans are accountable. Machines are not - at least not at the moment.)
So my point is, that while you might get some false positives, it's worth automating as long as many of the decisions are reversible or correctable.
You might not want to use this in all cases, but it's still worthwhile for many many cases. The use case worth automating depends on the acceptable rate of error for the given use case.
Dynamic CVV would mean you'd have to authorize the payment. If amount seems off, decline.
To be clear, I don't think I'd use it but if it could save you time (a precious value, in our day and age) with good signal to noise ratio it is win-win for user, author, and Amazon.
If you want to buy an Apple device from a trusted party, including trusted accessories, then there's apple.com. My point being: buying from there is much more secure. But even then, there is no '1 iPhone 16'; there's variants. Many of them.
If you have an AWS account created before 2017, am Amazon ban means an AWS ban
We're working on 2 major improvements that will get cost down at scale: 1. We're building a code generation layer under the hood that will start to memorize actions Skyvern has taken on a website, so repeated runs will be nearly free 2. We're exploring some graph re-ranking techniques to eliminate useless elements from the HTML DOM when analyzing the page. For example, if you're looking at the product page and want to add a product to cart, the likelihood you'll need to interact with the Reviews page will be 0. No need to send that context along to the LLM
Computer vision is useful and very quick, however, it has been my experience parsing stacking context is much more useful. The problem is creating a stacking context when a news site embeds a youtube or blusky post. It requires injecting script into each using playwright. (Not mine, but, prior art [0]).
I've been quietly solving a problem I encountered creating browser agents that didn't have a solution 2 years ago in my free time. Most webpages are several independent global execution contexts and I'm developing a coherent way to get them all to speak with each other. [1]
> "Go to Amazon.com and add an iPhone 16, a screen protector, and a case to cart"
Are you familiar with Google Dialogflow? [2] It is a service which returns an object with intent and parameters which make it is to map to automation actions. I asked GhatGPT to help with an example of how Dialogflow might handle your request. [3]
[0] https://github.com/andreadev-it/stacking-contexts-inspector
[1] https://news.ycombinator.com/item?id=42576240
[2] https://cloud.google.com/dialogflow/es/docs/intents-overview
[3] https://chatgpt.com/share/678ae18d-5370-8004-97d4-f9949887b0...
We definitely need a new dataset with more complex tasks, like uploading files, handling multiple tabs, and handling many more steps.
I couldn't find where browser-use published their run results (expected to see it here https://github.com/browser-use/eval)
We went ahead and published our full run at https://eval.skyvern.com so our run could be independently audited