Hi HN, I built this. It's been in production across 500+ websites.
We're a research group that studies online communications. We needed to scrape hundreds of sites regularly — news,
blogs, forums, policy orgs — and maintain all those scrapers. At 10 sites, individual scrapers were fine. At 200+
we were spending more time fixing broken scrapers than doing actual work. Every redesign broke something, every new
site meant another scraper from scratch.
ScrapAI flips the cost model. You tell an AI agent "add bbc.co.uk to my news project." It analyzes the site, writes
URL patterns and extraction rules, tests on 5 pages, and saves a JSON config to a database. After that it's just
Scrapy — no AI in the loop, no per-page inference calls. ~$1-3 in tokens per website with Sonnet 4.5, not per page.
Cloudflare was the hardest part. Most tools keep a browser open for every request (~5-10s per page). We use
CloakBrowser (open source, C++ stealth patches, 0.9 reCAPTCHA v3 score) to solve the challenge once, cache the
cookies, kill the browser, and hit the site with normal HTTP. Re-solves every ~10 minutes. 1,000 pages in ~8
minutes vs 2+ hours.
The agent writes JSON configs, not Python. An agent that writes and runs code can do anything an unsupervised
developer can — one prompt injection from a malicious page and you have a real problem. JSON goes through Pydantic
validation before it touches the database. Worst case is a bad config that extracts wrong fields. This also makes
it safe to use as a tool for Claws — structured web data without arbitrary code execution.
~4,000 lines of Python. Scrapy, SQLAlchemy, Alembic. Apache 2.0. We recommend Claude Code with Sonnet 4.5 but it
works with any agent that can read instructions and run shell commands. We tried GLM 4.7 and it performed
similarly, just slower.