Large-scale online deanonymization with LLMs (including HN users)(arxiv.org)

3 pointsby salkahfi6 hours ago1 comment

SilverElfin3 hours ago
> We collect 987 LinkedIn profiles linked to 995 Hacker News (HN) accounts (ground truth is established by users who posted their LinkedIn URL in their HN bio), drawn from a candidate pool of approximately 89,000 active HN users. Eight LinkedIn profiles are linked to multiple HN accounts that shared the same LinkedIn URL. We identified four additional HN alt accounts using strong evidence such as matching names and companies. We count a match as correct if any of the linked HN accounts is returned. Every query has a true match in the candidate set. The LinkedIn side represents the known identity with real professional profiles. The HN side serves as the anonymized target: as in Section˜2, we remove names, URLs, and other direct identifiers from bios using an LLM to prevent trivial matching (see Appendix˜A for our complete anonymization procedure). The task is to match a LinkedIn profile with the corresponding LLM-anonymized HN account.
Seems like a smart way to conduct this study. But the implications are scary. Maybe platforms should automatically do things that help anonymize.
- verdverm3 hours ago
  They remove profile information from HN dataset, relying only comments, to remove trivial matching, but my comments have trivial matching without doubt.
  I'm not sure if/how they selected for people actually trying to be anonymous versus someone like me who explicitly wants the connections to be easy and link it all over.
  Curious if I'm in the dataset, am I able to find out?
  Also, there is an old HN post that worked just on HN data, pre LLM. Submit some text and it gives you the most likely HN users with confidence scores.
  We have a lot of "fingerprints", our writing being one. Interestingly, Ai may actually be a way to anonymize your writing