2 pointsby mehula10 hours ago2 comments
  • mehula10 hours ago
    Hey HN - I built this.

    I'm building infrastructure for AI agents and kept running into the same problem: before an agent fetches a URL, there's no easy way to know what's allowed. There are now 8 different standards - robots.txt, llms.txt, ai.txt, TDMRep, Cloudflare Content Signals, and others - all saying different things in different formats. No one checks all of them. Most agents check zero.

    So I decided to actually measure the problem. I crawled the Tranco top 1M domains over 10 days in February 2026, parsing every known AI policy signal. Failure rate was 0.07% (697 domains out of 1M).

    What surprised me most:

    - 90% of domains have zero AI-specific signals. Not "they block everything" - they literally say nothing. Most robots.txt files just have generic /admin/ or /wp-login/ rules from a decade ago.

    - When sites DO block, it's almost always a blanket decision. 58,791 domains block both GPTBot and ClaudeBot. Only 9,888 block GPTBot alone. The "nuanced policy" that regulators imagine basically doesn't exist.

    - Cloudflare sites block AI at 2.3x the baseline rate. Not because their owners care more - because Cloudflare shipped a one-click toggle in July 2024. The tooling creates the behavior.

    - TDMRep adoption: 37 out of 1 million. That's the W3C protocol specifically designed for the EU Copyright Directive's TDM opt-out. Caveat: our detection covers the well-known path and HTTP headers, not HTML meta tags on subpages – actual adoption among European publishers is likely higher. We note this in the methodology.

    - The ToS gap is the finding I think matters most. We scanned 79K Terms of Service pages. 7,575 domains prohibit crawling or AI training in their ToS but have zero AI-specific robots.txt rules. YouTube, Discord, Substack, Target - an agent checking only robots.txt sees "no policy" while the site's legal terms explicitly say stop.

    - 6,317 domains contradict themselves across standards - e.g., blocking GPTBot in robots.txt but setting search=yes in Content Signals.

    This is the first public output from a project called Maango, which is building a registry and API to check any domain's AI policy across all 8 standards in one call. The report is free and the methodology is documented in full.

    Happy to answer questions about the data, methodology, or the agent compliance space generally.

  • throwawayffffas10 hours ago
    I think most startups policy is "We have professional indemnity insurance that covers our use of AI agents".
    • mehula10 hours ago
      Insurance covers the lawsuit. It doesn't un-scrape the content. Haha. It's better to stay compliant from day 1