Show HN: ScrapAI – We scrape 500 sites. AI runs once per site, not per page(github.com)
1 pointby iranu5 hours ago1 comment
iranu5 hours ago
  Hi HN, I built this. It's been in production across 500+ websites.

  We're a research group that studies online communications. We needed to scrape hundreds of sites regularly — news,
  blogs, forums, policy orgs — and maintain all those scrapers. At 10 sites, individual scrapers were fine. At 200+
  we were spending more time fixing broken scrapers than doing actual work. Every redesign broke something, every new
   site meant another scraper from scratch.

  ScrapAI flips the cost model. You tell an AI agent "add bbc.co.uk to my news project." It analyzes the site, writes
   URL patterns and extraction rules, tests on 5 pages, and saves a JSON config to a database. After that it's just
  Scrapy — no AI in the loop, no per-page inference calls. ~$1-3 in tokens per website with Sonnet 4.5, not per page.

  Cloudflare was the hardest part. Most tools keep a browser open for every request (~5-10s per page). We use
  CloakBrowser (open source, C++ stealth patches, 0.9 reCAPTCHA v3 score) to solve the challenge once, cache the
  cookies, kill the browser, and hit the site with normal HTTP. Re-solves every ~10 minutes. 1,000 pages in ~8
  minutes vs 2+ hours.

  The agent writes JSON configs, not Python. An agent that writes and runs code can do anything an unsupervised
  developer can — one prompt injection from a malicious page and you have a real problem. JSON goes through Pydantic
  validation before it touches the database. Worst case is a bad config that extracts wrong fields. This also makes
  it safe to use as a tool for Claws — structured web data without arbitrary code execution.

  ~4,000 lines of Python. Scrapy, SQLAlchemy, Alembic. Apache 2.0. We recommend Claude Code with Sonnet 4.5 but it
  works with any agent that can read instructions and run shell commands. We tried GLM 4.7 and it performed
  similarly, just slower.