I’ve spent a lot of time working on data pipelines, and one of the most frustrating problems is accidentally syncing PII or developer secrets (like AWS keys or SSNs) into a data warehouse or downstream system.
Most of the enterprise tools that solve this are either massive Java applications, require complex Python environments, or cost $50k/year. I just wanted a lightning-fast, single binary I could drop into a CI/CD pipeline (--fail-on-pii) or run locally against a Postgres DB to see my exposure. So, I built pii-hound.
A few technical details on how it works under the hood:
Memory Efficiency: Scanning a 50GB CSV file shouldn't cause an OOM error. It uses a concurrent, streaming architecture and implements Reservoir Sampling so it can sample huge datasets sequentially while maintaining randomness and a tiny memory footprint.
Speed: For the keyword and column-name heuristics, I implemented Aho-Corasick string matching, which is significantly faster than running dozens of individual regexes against every header.
Accuracy: To cut down on false positives, things like Credit Card numbers don't just use regex; they are piped through a Luhn algorithm validation step.
Full transparency: I originally wrote the core of this scanning engine for a larger data management platform I’m building called Saddle Data. But I realized the scanner itself is incredibly useful as a standalone utility, so I extracted it, polished the CLI, and open-sourced it under the MIT license.
It currently supports Postgres, MySQL, Snowflake, BigQuery, SQLite, S3, GCS, and local files (CSV/JSON/Parquet).
I'd love for you to point it at a local database or a messy CSV and let me know how it performs. Happy to answer any questions about the Go implementation, and PRs for new regex rules or source connectors are very welcome!
Beyond that, pii-hound supports custom rules. A user could create some rules to match known names if they wanted.
I am open to ideas of other ways to close that gap.