If someone visits my website that I control using your Chrome Extension, I will 100% be able to find a way to drain all their accounts probably in the background without them even knowing. Here are some ideas about how to mitigate that.
The problem with Playwright is that it requires Chrome DevTools Protocol (CDP) which opens massive security problems for a browser that people use for their banking and managing anything that involves credit cards are sensitive accounts. At one point, I took the injected folder out of Playwright and injected it into a Chrome Extension because I thought I needed its tools, however, I quickly abandoned it as it was easy to create workflows from scratch. You get a lot of stuff immediately by using Playwright but likely you will find it will be much lighter and safer to just implement that functionality by yourself.
The only benefit of CDP for normal use is allowing automation of any action in the Chrome Extension that requires trusted events, e.g. play sound, go fullscreen, banking websites what require trusted event to transfer money. I'm my opinion, people just want a large part of the workflow automated and don't mind being prompted to click a button when trusted events are required. Since it doesn't matter what button is clicked you can inject a big button that says continue or what is required after prompting the user. Trusted events are there for a reason.
[0] https://github.com/andreadev-it/stacking-contexts-inspector
[0] https://chatgpt.com/c/682a2edf-e668-8004-a8ce-568d5dd0ec1c
I'm not definite (I'm supposed to be working on something else sorry if I'm wrong here), however, I believe this is the code Browser Use uses for stacking context including piercing the shadow DOM. [1] Because they build a map with all the visible elements, they can inject different color borders around them. Here they test for the topmost elements in the viewport. [2]
[0] https://chatgpt.com/share/682a68bf-c6a0-8004-9c20-15508e6b3b...
[1] https://github.com/browser-use/browser-use/blob/55d078ed5a49...
[2] https://github.com/browser-use/browser-use/blob/55d078ed5a49...
This is half the equation. Also, lot of the information in the markup can be used to query elements to interact with because it keeps the link locations which can be used to navigate or select elements. On the other hand, by using the stacking context, it is possible query only elements that are visible which removes all elements that can't be interacted with.
Related to that, I'd suggest also adding the ability to "templify" sessions, ie. turn sessions into sort of like email templates, with placeholder tags or something of the like, that either ask the user for input, or can be fed input from somewhere else (like an "email merge")
So for example, if I need to get certain data from 10 different websites, either have the macro/session ask me 10 times for a new website (or until I stop it), or allow me to just feed it a list
Anyway, great work! Oh also, if you want to be truly privacy-first you could add support for local LLMs via ollama
I like that suggestion. Saved prompts seem like an obvious addition, and having templating within them makes sense. I wonder how well would "for each of the following websites do X" prompts work (so have the LLM do the enumeration rather than the client - my intuition is that it won't be as robust because of the long accumulated context)
Edit: forgot to mention it does support Ollama already
So for the example above, the user might have to do: "do this for this website", then save macro, then create template, then run template with input: [list of 10 websites]
I also started with with conversational mode and interactive mode, but later removed the interactive mode to keep its features a bit simple.
Does it send the content of the website to the LLM?
How is it “privacy-first” then if it literally sends all your shit to the LLM?
Also, the line is blurry for some people on “privacy” when it comes to LLMs. I think some people, not me, think that if you are talking directly to the LLM provider API then that’s “private” whereas talking to a service that talks to the LLM is not.
And, to be fair, some people use privacy/private/etc language for products that at least have the option of being private (Ollama).
Because it supports Ollama, which runs the LLM entirely locally on your own hardware, thus data sent to it never leaves your machine?
Edit: joshstrange beat me to the same conclusion by mere moments. :)
(At least, that's how I understand it - I have the feature turned off myself.)
Would love to explore a FF port. Right now, there are a couple of tight Chrome dependencies:
- CDP - mostly abstracted away by Playwright so perhaps not a big lift
- IndexedDB for storing memories and potentially other user data - not sure if there's a FF equivalent
It struggled with tasks I asked for (e.g. download the March and April invoices for my GitHub org "myorg") -- it got errors parsing the DOM and eventually gave up. I recommend taking a look at the browser-use approach and specifically their buildDOMTree.js script. Their strategy for turning the DOM into an LLM parsable list of interactive elements, and visually tagging them for vision models, is unreasonably effective. I don't know if they were the first to come up with it, but it's genius and extracting it for my browser-using agents has hugely increased their effectiveness.
The fact that Chrome and Gemini are, at least for now, owned by the same company raises huge privacy and consumer choice concerns for me though, and I see benefit in letting the user choose their model, where/how to store their data, etc.
Can't possibly do tool calling well enough to handle browser automation.
I read over the repo docs and was amazed at how clean and thorough it all looks. Can you share your development story for this project? How long did it take you to get here? How much did you lean on AI agents to write this?
Also, any plans for monetization? Are you taking donations? :)
I might write a short post on the development process, but in short:
- started development during Easter so roughly a month so far
- developed mostly using Cline and Claude 3.7
- inspired and borrowed heavily by Cline, Playwright MCP, and Playwright CRX which had solved a lot of the heavy lifting already - in a sense this project is those 3 glued together
I don't plan to monetize it directly, but I've thought about an opt-in model for contributing useful memories to a central repository that other users might benefit from. My main aim with it is to promote open source AI tools.
The biggest issue is having the Ollama models hardcoded to Qwen3 and Llama 3.1. I imagine most Ollama users have their favorites, and probably vary quite a bit. My main model is usually Gemma 3 12B, which does support images.
It would be a nice feature to have a custom config on the Ollama settings page, save those to Chrome storage, and use that in the 'getAvailableModels' method, along with the hardcoded models.
- The ones the user sees (like a sidepanel). These often use LLM API's like OpenAI.
- The browser API ones. These are indeed local, but are often very limited smaller models (for Chrome this is Gemini Nano). Results from these would be lower quality, and of course with large contexts, either impossible or slower than using an API.
So most likely the LLM can chose how to "see" the page?
Some of the cheaper models have very similar performance at a fraction of a cost, or indeed you could use a local model for "free".
The core issue though is that there's just more tokens to process in a web browsing task than many other tasks we commonly use LLMs for, including coding.
I've been disappointed by the fact that Chrome doesn't have this. I don't want to give full access to my browsing to a random extension (not an offense to this specific one, but general security hygiene - there are so many scammy extensions out there). Chrome (or browser of your choice) already has that trust, good or bad. Please use the trust in a good way. It's a table stake at this point.