There's also Python bindings for the fork for anyone who uses Python: https://github.com/lexiforest/curl_cffi
Clean for data ingestion usually means complicated for data creation - optimizing for the advertisers has material cash value downstream, but customers are upstream, and making it harder is material too.
If what I've seen from CloudFlare et.al. are any indication, it's the exact opposite --- the amount of fingerprinting and "exploitation" of implementation-defined behaviour has increased significantly in the past few months, likely in an attempt to kill off other browser engines; the incumbents do not like competition at all.
The enemy has been trying to spin it as "AI bots DDoSing" but one wonders how much of that was their own doing...
No, they're discussing increased fingerprinting / browser profiling recently and how it affects low-market-share browsers.
> The enemy has been trying to spin it as "AI bots DDoSing" but one wonders how much of that was their own doing...
I'm reading that as `enemy == fingerprinters`, `that == AI bots DDoSing`, and `their own == webmasters, hosting providers, and CDNs (i.e., the fingerprinters)`, which sounds pretty straightforwardly like the fingerprinters are responsible for the DDoSing they're receiving.
That interpretation doesn't seem to match the rest of the post though. Do you happen to have a better one?
This is entirely web crawler 2.0 apocolypse.
dishwasher runs at least once a day, at least 80% full, every day, unless we're traveling.
I love this curl, but I worry that if a component takes on the role of deception in order to "keep up" it accumulates a legacy of hard to maintain "compatibility" baggage.
Ideally it should just say... "hey I'm curl, let me in"
The problem of course lies with a server that is picky about dress codes, and that problem in turn is caused by crooks sneaking in disguise, so it's rather a circular chicken and egg thing.
What? Ideally it should just say "GET /path/to/page".
Sending a user agent is a bad idea. That shouldn't be happening at all, from any source.
10.15 User-Agent
The User-Agent request-header field contains information about the
user agent originating the request. This is for statistical purposes,
the tracing of protocol violations, and automated recognition of user
agents for the sake of tailoring responses to avoid particular user
agent limitations.
That's already the case. The trouble is that NSS (what firefox uses) doesn't support the same cipher suites as boringssl (what chrome uses?).
(AIUI Google’s Play Store is one of the biggest TLS fingerprinting culprits.)
The companies to blame here are solely the ones employing these fingerprinting techniques, and those relying on services of these companies (which is a worryingly large chunk of the web). For example, after the Chrome change, Cloudflare just switched to a fingerprinter that doesn't check the order.[1]
Sure. And it's a tragedy. But when you look at the bot situation and the sheer magnitude of resource abuse out there, you have to see it from the other side.
FWIW the conversation mentioned above, we acknowledged that and moved on to talk about behavioural fingerprinting and why it makes sense not to focus on the browser/agent alone but what gets done with it.
Let's not go blaming vulnerabilities on those exploiting them. Exploitation is also bad but being exploitable is a problem in and of itself.
Add to this that the minute you use a signal for detection, you “burn” it as adversaries will avoid using it, and you lose measurement thus the ability to know if you are fixing the problem at all.
I worked on this kind of problem for a FAANG service, whoever claims it’s easy clearly never had to deal with motivated adversaries
There's "vulnerabilities" and there's "inherent properties of a complex protocol that is used to transfer data securely". One of the latter is that metadata may differ from client to client for various reasons, inside the bounds accepted in the standard. If you discriminate based on such metadata, you have effectively invented a new proprietary protocol that certain existing browsers just so happen to implement.
It's like the UA string, but instead of just copying a single HTTP header, new browsers now have to reverse engineer the network stack of existing ones to get an identical user experience.
It isn't necessarily a critical vulnerability. But it is a problem on some level nonetheless. To the extent possible you should not be leaking information that you did not intend to share.
A protocol that can be fingerprinted is similar to a water pipe with a pinhole leak. It still works, it isn't (necessarily) catastrophic, but it definitely would be better if it wasn't leaking.
If not, then fingerprinting could still be done to some extent at the IP layer. If the TTL value in the IP layer is below 64, it is obvious this is either not running on modern Windows or is running on a modern Windows machine that has had its default TTL changed, since by default the TTL of packets on modern Windows starts at 128 while most other platforms start it at 64. Since the other platforms do not have issues communicating over the internet, so IP packets from modern Windows will always be seen by the remote end with TTLs at or above 64 (likely just above).
That said, it would be difficult to fingerprint at the IP layer, although it is not impossible.
Only if you're using PaaS/IaaS providers don't give you low level access to the TCP/IP stack. If you're running your own servers it's trivial to fingerprint all manner of TCP/IP properties.
If everywhere is reachable in under 64 hops, then packets sent from systems that use a TTL of 128 will arrive at the destination with a TTL still over 64 (or else they'd have been discarded for all the other systems already).
If you count up from zero, then you'd also have to include in every packet how high it can go, so that a router has enough info to decide if the packet is still live. Otherwise every connection in the network would have to share the same fixed TTL, or obey the TTL set in whatever random routers it goes through. If you count down, you're always checking against zero.
Based on the fact that they are requesting the same absolutely useless and duplicative pages (like every possible combniation of query params even if it does not lead to unique content) from me hundreds of times per url, and are able to distribute so much that I'm only getting 1-5 requests per day from each IP...
...cost does not seem to be a concern for them? Maybe they won't actually mind ~5 seconds of CPU on a proof of work either? They are really a mystery to me.
I currently am using CloudFlare Turnstile, which incorporates proof of work but also various other signals, which is working, but I know does have false positives. I am working on implementing a simpler nothing but JS proof of work (SHA-512-based), and am going to switch that in and if it works great (becuase I don't want to keep out the false positives!), but if it doesn't, back to Turnstile.
The mystery distributred idiot bots were too much. (Scaling up resources -- they just scaled up their bot rates too!!!) I don't mind people scraping if they do it respectfully and reasonably; taht's not what's been going on, and it's an internet-wide phenomenon of the past year.
Writing a browser is hard, and the incumbents are continually making it harder.
Doesn't get more fingerprintable than that. They provide an un-falsifiable certificate that "the current browser is an unmodified Chrome build, running on an unmodified Android phone with secure boot".
If they didn't want to fingerprintable, they could just not do that and spend all the engineering time and money on something else.
[1]: https://en.wikipedia.org/wiki/Web_Environment_Integrity
Was having issues getting module to download an installer from a vendors site.
Played with Curl/WGET, but was running into the same, while it worked from a browser.
I ended up getting both Curl + get_url to work by passing the same headers my browser sent such as User-Agent, encoding, etc
Examples: [missing]
They still want to install a bunch of programs on your computer, though. It's more or less the same stuff that used to be written as ActiveX extensions, but rewritten using modern browser APIs. :(
I, too, am saddened by this gatekeeping. IIUC custom browsers (or user-agent) from scratch will never work on cloudflare sites and the like until the UA has enough clout (money, users, etc) to sway them.
There's too much lost revenue in open things for companies to embrace fully open technology anymore.
One may posit "maybe these projects should cache stuff so page loads aren't actually expensive" but these things are best-effort and not the core focus of these projects. You install some Git forge or Trac or something and it's Good Enough for your contributors to get work done. But you have to block the LLM bots because they ignore robots.txt and naively ask for the same expensive-to-render page over and over again.
The commercial impact is also not to be understated. I remember when I worked for a startup with a cloud service. It got talked about here, and suddenly every free-for-open-source CI provider IP range was signing up for free trials in a tight loop. These mechanical users had to be blocked. It made me sad, but we wanted people to use our product, not mine crypto ;)
I love that human, what a gem
If you thought cloudflare challenge can be bad, imperva doesn't even want most humans through
It’d be nice if something could support curl’s arguments but drive an actual headless chrome browser.
[0] https://github.com/explainers-by-googlers/Web-Environment-In...
An HTTP client sends a request. The server sends a response. The request and response are made of bytes. Any bytes Chrome can send, curl-impersonate could also send.
Chromium is open source. If there was some super secret handshake, anyone could copy that code to curl-impersonate. And if it's only in closed-source Chrome, someone will disassemble it and copy it over anyway.
Not if the "super secret handshake" is based on hardware-backed attestation.
GP claims the API can detect the official chrome browser, and the official chrome browser runs fine without attestation.
Not if Chrome uses homomorphic encryption to sign a challange. It's doable today. But then you could run a real Chrome and forward the request to it.
It doesn't matter how complicated the operation is, if you have a copy of the Chrome binary, you can observe what CPU instructions it uses to sign the challenge, and replicate the operations yourself. Proxying to a real Chrome is the most blunt approach, but there's nothing stopping you from disassembling the binary and copying the code to run in your own process, independent of Chrome.
No you can't, that's the whole thing with homomorphic encryption. Ask GPT to explain it to you why it's so.
You have no way of knowing the bounds of the code I will access from the inside the homomorphic code. Depending on the challenge I can query parts of the binary and hash that in the response. So you will need to replicate the whole binary.
Similar techniques are already used today by various copy-protection/anti-cheat game protectors. Most of them remain unbroken.
Homomorphic encryption hides data, not computation. If you've been trying to learn compsci from GPT, you might have fallen victim to hallucinations. I'd recommend starting from wikipedia instead. https://en.wikipedia.org/wiki/Homomorphic_encryption
And btw most games are cracked within a week of release. You have way too much faith in buzzwords and way too little faith in bored Eastern European teenagers.
Data is computation.
x = challenge_byte ^ secret_key
if x > 64:
y = hash_memory_range()
else:
y = something_else()
return sign(y, secret_key)
You seem convinced that homomorphic encryption is some kind of magic that prevents someone from observing their own hardware, or from running Chrome under a debugger. That's just not true. And I suspect we don't share enough of a common vocabulary to have a productive discussion, so I'll end it here.
You can't use the result of that computation without first decrypting it though. And you can't decrypt it without the key. So what you describe regarding memory addresses is merely garden variety obfuscation.
Unmasking an obfuscated set of allowable address ranges for hashing given an arbitrary binary is certainly a difficult problem. However as you point out it is easily sidestepped.
You are also mistaken about anti-cheat measures. The ones that pose the most difficulty primarily rely on kernel mode drivers. Even then, without hardware attestation it's "just" an obfuscation effort that raises the bar to make breaking it more time consuming.
What you're actually witnessing there is that if a sufficient amount of effort is invested in obfuscation and those efforts carried out continuously in order to regularly change the obfuscation then you can outstrip the ability of the other party to keep up with you.
EDIT: this is the closest I could find. https://developers.google.com/chrome/verified-access/overvie... ...but it's not generic enough to lead me to the declaration you made.
``` some web services use the TLS and HTTP handshakes to fingerprint which client is accessing them, and then present different content for different clients. These methods are known as TLS fingerprinting and HTTP/2 fingerprinting respectively. Their widespread use has led to the web becoming less open, less private and much more restrictive towards specific web clients
With the modified curl in this repository, the TLS and HTTP handshakes look exactly like those of a real browser. ```
For example, this will get you past Cloudflare's bot detection.
And “unclean fork” is such an unnecessary and unprofessional comment.
There’s an entire industry of stealth browser technologies out there that this falls under.