I'm not a fan of all the documentation and marketing content for this project evidently being AI-generated because I don't know which parts of it are the things you believe and designed for, and which are just LLM verbal diarrhea. For example, your GitHub threat model says this stops "AI training crawlers (GPTBot, ClaudeBot, CCBot, etc.)" - is this something you've actually confirmed, or just something that AI thinks is true? I don't know how their scrapers work; I'd assume they use headless browsers.
On the AI docs concern, fair point. To answer directly: I've confirmed the obfuscation defeats any scraper reading raw HTML via HTTP requests. Whether GPTBot or ClaudeBot use headless browsers internally, I honestly don't know. The README threat model lists headless browsers under "what it does NOT stop" for that reason.
Official OpenAI documentation: https://platform.openai.com/docs/gptbot
function decodeObscrd(htmlOrElement) {
let root;
if (typeof htmlOrElement === 'string') {
root = new DOMParser().parseFromString(htmlOrElement, 'text/html').body;
} else {
root = htmlOrElement || document;
}
const container = root.querySelector('[class*="obscrd-"]');
if (!container) { return; }
const words = [...container.children].filter(el => el.hasAttribute('data-o'));
words.sort((a, b) => +a.dataset.o - +b.dataset.o);
const result = words.map(word => {
const chars = [...word.querySelectorAll('[data-o]')]
.filter(el => el.querySelector('[data-o]') === null);
chars.sort((a, b) => +a.dataset.o - +b.dataset.o);
return chars.map(c => c.textContent).join('');
}).join('');
console.log(result);
return result;
}At this point, bots are better at getting data out of web pages than people are. (And have been so for at least a few years: https://www.usenix.org/conference/usenixsecurity23/presentat...)
All we're doing now is making it easier to get data from a web scraper than to browse to the web page ourselves.
- today you don't have to be a dedicated/motivated reverse engineer- you just need Sonnet 4.6 and let it do the work.
- you need to throw constant/new gotchas to LLMs to keep them on their tows while they try to reverse engineer your website.
* Copy text
* use a screen reader for accessibility purposes (not just on the web, but on mobile too. Your 'light' obfuscation is entirely broken with TalkBack on Android. individual words/characters read, text is not a single block)
* use an RSS feed
* use reader mode in their browser
If you don't want your stuff to be read, and that includes bots, don't put it online.
> Built this because I got tired of AI crawlers reading my HTML in plain text while robots.txt did nothing.
You could have spent that time working on your project, instead of actively making the web worse than it already is.
On the broader point, I hear you, but I think there's a middle ground. Not all content is public knowledge. Some of it is premium, proprietary, or behind a paywall. The people publishing it should get to decide whether it becomes free training data.
I don't follow. Are you suggesting that someone is scraping private sites that they have to log in on in order to train AI on it?
It breaks copy/paste and screen readers, but so does your idea.
Same result: screen readers and assistive software is rendered useless. Basically is a sign of "I hate disabled people, and AI too"
Happy to have a11y experts poke at it and point out gaps.
It is better for a million AI crawlers to get through than for even one search index crawler, that might expose the knowledge on your site to someone who needs it, to be denied.