What I would love to see either as something leveraging this, or built in to this, is if you prompt stagehand to extract data from a page, it also returns the xpath elements you'd use to re-scrape the page without having to use an LLM to do that second scraping.
So basically, you can scrape new pages never before seen with the non-deterministic LLM tool, and then when you need to rescrape the page again to update content for example, you can use the cheaper old-school scraping method.
Not sure how brittle this would be both going from LLM version to xcode version reliably, or how to fallback to the LLM version if your xcode script fails, but overall conceptually, being able to scrape using the smart tools but then building up basically a library of dumb scraping scripts over time would be killer.
I'm still trying to figure out the boundaries of where and how I want to scale that out into other stuff -- things like when to use `page` methods directly, vs passing a function into `page.evaluate`, vs other alternatives like a browser extension or a CLI tool. And I'm still needing to work around smaller issues with the polyfill and its spec coverage (leaving me to use things like `getAttribute` more than I would otherwise). But in the meantime it's simplified a lot of ancillary issues, like handling failures on my existing workflows and scaling out to new targets, while I work on other bot detection issues.
But of course, the way it works now could also help reduce the brittleness. With an xpath or selector, it quickly breaks when the design changes or things are moved around. With this, it might overcome this.
So tradeoffs, I guess.
Repeatability of extract() is definitely super interesting and something we're looking into
disclaimer: i am the author
Personally I'd love to use this as an intermediate workflow for producing deterministic playwright code, but it looks like this is intended for running directly.
I don't think I could plausibly argue for using LLMs at runtime in our test suite at work...
Most of my test failures come down to timing issues—CPU load subtly affects execution, leading to random timeouts. This makes it difficult to run tests both quickly and consistently. While proactive load-testing of the test environment and introducing artificial random delays during test authoring can help, these steps often end up taking more time than writing the tests themselves.
It would be amazing if tools were smart enough to detect these false positives automatically. After all, if a human can spot them, shouldn’t AI be able to as well?
Basically, the goal would be to do it like with screenshot regression tests: basically you get 2 different execution phases: - generate - verify
And when verify fails in CI, you can automatically run a generate and open a MR/PR with the new script.
This let's you audit the script and make a plausibility check and you'll be notified on changes but have minimal effort to keep the tests running
And I think you misunderstood my comment, I didn't describe my project, but extrapolated from the parents desire and my motivations for my project.
Mine is actually pretty close to stagehand, at least I could very well use it. It's basically a web UI to configure browser tasks like open webpage x, iterate over "item type", with LLM integration to determine what the CSS selector for that would be. On next execution it would attempt to use the previously determined CSS selector instead of the LLM integration. On failures, it'd raise a notification with an admin tasks to verify new selectors/fix the script
But it's a lot of code to put together as a generic UI - as I want these tasks to be repeatable without restarting from the beginning etc
Still very much in the PoC stage without any tests, barely working persistence etc
We built basically this: Let an LLM agent take a look at your web page and generate the playwright code to test it. Running the test is just running the deterministic playwright code.
Of course, the actual hard work is _maintaining_ end-to-end tests so our agent can do that for you as well.
Feel free to check us out, we have a no-hassle free tier.
Treating UI test code as some kind of static source of truth is the biggest nightmare in all of UI front end development. Web UIs naturally have a ton of "jank" that accumulates over time, which leads to a ton of false negatives; slow API calls, random usages of websockets/SSE, invisible elements, non-idempotent endpoints, etc. etc. And having to write "deterministic" test code for those is the single biggest reason why no one ever actually does it.
I don't care that the page I'm testing has a different DOM structure now, or uses a different button component with a different test ID. All I care about is "can the user still complete X workflow after my changes have been made". If the LLM wants to completely rewrite the underlying test code, I couldn't care less so long as it still achieves that result and is assuring me that my application works as intended E2E.
It is, in fact, very possible to extract value from testing methods like this, provided you take the proper care and control both the UI and the tests. It's definitely very easy to end up with a flaky suite of tests that's a net drag on productivity, but it's not inevitable.
On the other hand, I have every confidence that an LLM-based test suite would introduce more flakiness and uncertainty than it could rid me of.
And no one ever does. There is zero incentive to spend days wrangling with a flakey UI test throwing a false positive for your new feature, and so the test gets skipped and everyone moves on and forgets about it. I have literally never seen a project where UI tests were continually added to and maintained after the initial build out, simply because it is an immense time sink with no visible or perceived value to product, business, or users, and requires tons of manual maintenance to keep in sync with the application.
Rather, we want Stagehand to assist people who want to build web agents. For example, I was using headless browsers earlier in 2024 to do real-time RAG on e-commerce websites that could aggregate results for vibes-based search queries. These sites might have random DOM changes over time that make it hard to write sustainable DOM selectors, or annoying pop-ups that are hard to deterministically code against.
This is the perfect use for Stagehand! If you're doing QA on your own site, then base Playwright (as you mention) is likely the better solution
People in the browser automation space consistently ignore this, for whatever reason. Though, it's right on their site in black and white.
It may become a real problem for the usefulness of this style of LLM driven browsing.
There are plugins that extend the major browser automation libraries to do this sort of thing.
And undetectable browser automation doesn't use Playwright, Puppeteer, or Selenium at all.
I recently tried to implement a workflow automation using similar frameworks that were playwright or puppeteer based. My goal was to log into a bunch of vendor backends and extract values for reporting (no APIs available). What stopped me entirely were websites that implemented an invisible captcha. They can detect a playwright instance by how it interacts with the DOM. Pretty frustrating, but I can totally see this becoming a standard as crawling and scraping is getting out of control.
Do you guys ever think you'll do a similar abstraction for MCP and computer use more broadly?
We're working on a better computer use integration using Stagehand, def a lot of interesting potential there
To be honest I took about 2 minutes of playing around to get annoyed with the inaccuracies of the locally hosted model for that, so I get why you encourage the other approaches.
You might want to check out Lightpanda (https://github.com/lightpanda-io/browser). It's an open-source, lightweight headless browser built from scratch for AI and web automation. It's focused on skipping graphical rendering to make it faster and lighter than Chrome headless.
The web isn't made for agents and automation. It's made for people.
Solely depending on a VLM is indeed reminiscent of how humans interact with the web, but when a model thrives with more data, why restrict the data sent to the model?
That said, the architecture's coming together and the performance gains we're seeing make us excited about what's possible as we keep building. Feedback is very welcome, especially on what APIs you'd like to see us prioritize for specific workflows and use cases.
npx create-browser-app --example persist-context
The general idea of it, yes. No one will be writing selector based tests by hand anymore in a couple years.
Also is there some level of deterministic behavior here or might every test run result in a different underlying command if your wording isn’t precise enough?
Skyvern: https://github.com/Skyvern-AI/skyvern
Shortest: https://github.com/anti-work/shortest
I’d love to hear what makes Stagehand different and pros/ cons.
Of course, I have no complaints to see more competition and open source work in this space. Keep up the great work!
However with Stagehand, because it's an extension of Playwright, it allows you to confirm each step of the underlying agent's workflow, making it the most customizable option for engineers who want/need that
https://playwright.dev/docs/codegen-intro
Is a chat bot easier to reiterate a test?
Some minimal model that could be run locally and specifically tuned for this purpose might be pretty fruitful here compared to delegating out to expensive APIs.
I often thought E2E testing should be done with AI. What I want is that the functionality works (e.g.: login, then start an assignment) without the need to change the test each time the UI changes.