If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.
Surely it's more likely that it's just cheaper to pay for the errors than to pay to fix the errors.
Why fix 10k worth of errors if it'll cost me 100k to fix it?
Add some % if person who gets more work from the problem is not the same as the person who needs to fix it. People will happily leave things in a broken state if no one calls them out on it.
Keep in mind that "fixing things" is essentially a Sisyphean task - no matter how much you do there's always more you can do. Just like adding features. You have to have some kind of guideline on when enough is enough.
I've been monitoring server logs across ~150 sites and the pattern is striking: AI crawler traffic increased roughly 8x in the last 12 months, but most site owners have no idea because it doesn't show up in analytics. The bots read everything, respect robots.txt maybe 60% of the time, and the content they index directly shapes what ChatGPT or Perplexity recommends to users.
The irony is that robots.txt was designed for a world where crawling meant indexing for search results. Now crawling means training data and real-time retrieval for AI answers. Completely different power dynamic and most robots.txt files haven't adapted.
"By accessing this file more than one time per second you agree to pay a fee of $0.1 per access plus an additional $0.1 for each previous access each day. This fee will be charged on a per access basis."
2. Run a program that logs the number for Facebook requests and prints a summary and bill.
2. Then get a stamp, envelope and write out a bill for the first day, call it a demand for payment and send it to:
Facebook, Inc. Attn: Security Department/Custodian of Records 1601 S. California Avenue Palo Alto, CA 94304 U.S.A.
You can optionally send this registered mail, where someone has to sign for it.
Corporations such as FaceBook are used to getting their way in court because they can afford lawyers and you cannot. So they have gotten lazy and do not worry about what is fair or legal.
So take them to court when you have a legitimate legal issue. The courts are there to provide redress when you are aggrieved. Right? Use the courts. You can file a small claims action easily. Just make sure you have 1) a legitimate case, 2) evidence 3) have sent them a demand for payment.
Do you think there's a contract created by your robots.txt comment?
https://refspecs.linuxbase.org/LSB_3.0.0/LSB-Core-generic/LS...
Rich previews are known to cause higher clickthroughs than non-rich previews (if you care about that).
Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.
Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.
And it's quite a trivial feature at that.
How does one learn these skills, I can see them being useful in the future
Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.
Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.
From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.
Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.
Ugh, such a weird design. At least my experience has been you are better off setting Apache to always run the same number of instances, and tuning that number as appropriate rather than having the instance count fluctuate under load.
Git content likely to have code for the bot to train on.