[1]: https://www.ty-penguin.org.uk/robots.txt
[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)
> I've not configured anything in my robots.txt and yes, this is an extreme position to take. But I don't much like the concept that it's my responsibility to configure my web site so that crawlers don't DOS it. In my opinion, a legitimate crawler ought not to be hitting a single web site at a sustained rate of > 15 requests per second.
So? What duty do web site operators have to be "nice" to people scraping your website?
That being said, the author is perhaps concerned by the growing amount of collecting process, which carries a toll on his server, and thus chose to simply penalize them all.
(big green blob)
"My cat playing with his new catnip ball".
(blue mess of an image)
"Robins nesting"
Or is the internet so full of garbage nowadays that it is necessary to do that on every page?
I was very excited 20 years ago, every time I got emails from them that the scripts and donated MX records on my website had helped catching a harvester
> Regardless of how the rest of your day goes, here's something to be happy about -- today one of your donated MXs helped to identify a previously unknown email harvester (IP: 172.180.164.102). The harvester was caught a spam trap email address created with your donated MX:
they have facebookexternalhit bot (they sometimes use default python request user agent) that (as they documented) explicitly ignores robots.txt
it's (as they say) used to validate links if they contain malware. But if someone would like to serve malware the first thing they would do would be to serve innocent page to facebook AS and their user agent.
they also re-check every URL every month to validate if this still does not contain malware.
the issue is as follows some bad actors spam Facebook with URLs to expensive endpoints (like some search with random filters) and Facebook provides then with free ddos service for your competition. they flood you with > 10 r/s for days every month.
That barely registers as a blip even if you're hosting your site on a single server.
This resulted in upscale. When handling such bot cost more than rest of the users and bots, that's an issue. Especially for our customers with smaller traffic.
This request rate varied from site to site, but it ranged from half to 75% of whole traffic and was basically saturating many servers for days if not blocked.
A refreshing (and amusing) attitude versus getting angry and venting on forums about aggressive crawlers.
https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
Some kind of statement piece
Firefox: Press F12, go to Network, click No Throttling > change it to GPRS
Chromium: Press F12, go to Network, click No Throttling > Custom > Add Profile > Set it to 20kbps and set the profile
https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
Terry Pratchett has one I'd like to think he'd approve of. Just a shame I'm unable to see the 8th colour, I'm sure it's in there somewhere.
https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...
Just because traffic is coming from thousands of devices on residential IPs, doesn't mean it's a botnet in the classical sense. It could just as well be people signing up for a "free VPN service" — or a tool that "generates passive income" for them — where the actual cost of running the software, is that you become an exit node for both other "free VPN service" users' traffic, and the traffic of users of the VPN's sibling commercial brand. (E.g. scrapers like this one.)
This scheme is known as "proxyware" — see https://www.trendmicro.com/en_ca/research/23/b/hijacking-you...
Easiest way to deal with them is just to block them regardless, because the probability that someone who knows what to do about this software and why it's bad will read any particularly botnetted website are close to zero.
Proxyware is more like a crypto miner — the original kind, from back when crypto-mining was something a regular computer could feasibly do with pure CPU power. It's something users intentionally install and run and even maintain, because they see it as providing them some potential amount of value. Not a bot; just a P2P network client.
Compare/contrast: https://en.wikipedia.org/wiki/Winny / https://en.wikipedia.org/wiki/Share_(P2P) / https://en.wikipedia.org/wiki/Perfect_Dark_(P2P) — pieces of software which offer users a similar devil's bargain, but instead of "you get a VPN; we get to use your computer as a VPN", it's "you get to pirate things; we get to use your hard drive as a cache node in our distributed, encrypted-and-striped pirated media cache."
(And both of these are different still to something like BitTorrent, where the user only ever seeds what they themselves have previously leeched — which is much less questionable in terms of what sort of activity you're agreeing to play host to.)
Related: https://en.wikipedia.org/wiki/JPEG#Syntax_and_structure
That said, these seem to be heavily biased towards displaying green, so one “sanity” check would be if your bot is suddenly scraping thousands of green images, something might be up.
On a different note, if the goal is to waste resources for the bot, on potential improvement could be to uses very large images with repeating structure that compress extremely well as jpegs for the templates, so that it takes more ram and cpu to decode them with relatively little cpu and ram required to generate them and bandwidth to transfer them.
PNG doesn't have size limitation on the image dimensions (4bytes each). So I bet you can break at least one scrap bot with that.
ZIP bombs rely on recursion or overlapping entries to achieve higher ratios, but the PNG format is too simple to allow such tricks (at least in the usual critical chunks that all decoders are required to support).
Does it? Encryption increases entropy, but not sure about compression.
Total information entropy - no. The amount of information conveyed remains the same.
I don't think JPEG data is compressed enough to be indistinguishable from random.
SD VAE with some bits lopped off gets you better compression than JPEG and yet the latents don't "look" random at all.
So you might think Huffman encoded JPEG coefficients "look" random when visualized as an image but that's only because they're not intended to be visualized that way.