31 pointsby tjruesch18 hours ago5 comments
  • bob10292 hours ago
    This is interesting work. My approach so far has been to keep the PII as far away as possible from the LLM. Right now it's salted hashes if it's anything at all.

    I would be tempted to try a pseudonymous approach where inbound PII is mapped to a set of consistent, "known good" fake identities as we transition in and out of the AI layer.

    The key with PII is to avoid combining factors over time that produce a strong signal. This is a wide spectrum. Some scenarios will be slightly identifying just because they are rare. Zip+gender isn't a very strong signal. Zip+DOB+gender uniquely identifies a large number of people. You don't need to screw up with an email address or tax id. Account balance over time might eventually be sufficient to target one person.

  • minixalpha8 hours ago
    I'd like to know if there's a tool that can automatically replace sensitive information before I paste content into ChatGPT, and then automatically restore the sensitive information when I copy the results from ChatGPT. The logic for both "replacement" and "restoration" should be handled locally on my computer.
    • dsp_person8 hours ago
      I've been thinking about playing with something like this.

      I'm curious to what limit you can randomly replace words and reverse it later.

      Even with code. Like say take the structure of a big project, but randomly remap words in function names, and to some extent replace business logic with dummy code. Then use cloud LLMs for whatever purpose, and translate back.

  • welcome_dragon9 hours ago
    Reversible as in you can re-identify? That sounds not secure
    • bigiain9 hours ago
      The post discusses that:

      Security First

      Because the “PII Map” (the link between ID:1 and John Smith) effectively is the PII, we treat it as sensitive material.

      The library includes a crypto module that forces AES-256-GCM encryption for the mapping table. The raw PII never leaves the local memory space, and the state object that persists between the masking and rehydration steps is encrypted at rest.

      I've bookmarked this for inspiration for a medium/long term project I am considering building. I'd like to be able to take dumps of our production database and automatically (one way) anonymize it. Replacing all names with meaningless but semantically representative placeholders (gender matching where obvious - Alice, Bob, Mallory, Eve, Trent perhaps, and gender neutral like Jamie or Alex when suitable). Use similar techniques to rewrite email addresses (alice@example.org, bob@example.com, mallory@example.net) and addresses/placenames/whatever else can be pulled out with Named Entity Recognition. I suspect I'll in general be able to do a higher accuracy version of this, since I'll have an understanding of the database structure and we're already in the process of adding metadata about table and column data sensitivity. I will definitely be checking out the regexes and NER models used here.

      • tjruesch3 hours ago
        That sounds interesting! I've been thinking about using representative placeholders as well, but while they have their strengths, there are also some downsides. We decided to go with an XML tag also because it clearly identifies the anonymized text as being anonymized (for humans) so mixups don't happen. After reading your comment I think it would also be really interesting to be able to add custom metadata to the tags. Like if you have a username that you want to anonymize, but your database has additional (deterministic) information like the gender, we should add a callback for you as the user to add this information to the tag.
    • fluidcruft9 hours ago
      My hope is it means it assigns coded identifiers and the key remains local. When the document returns, the identifiers can be restored. So the PII itself never leaves the premises.
      • tjruesch3 hours ago
        that's exactly right. PII stays local (and the PII-Tag-Map is encrypted)
  • handfuloflight10 hours ago
    This is an awesome share and development. Kudos!
  • 13 hours ago
    undefined