PAC is based on MIT's PAC Privacy framework: it hashes each privacy unit's key into a 64-bit value to create 64 sub-samples ("possible worlds"). Your aggregate runs on all 64 worlds independently; the result comes from one secret world, noised using the variance across all of them. Different hash + secret world per query makes membership inference provably hard.
Those 64 possible worlds map perfectly to 64-bit SIMD registers. We bitslice the entire computation, evaluating all the worlds in a single pass over the data using CPU vector instructions. The average overhead is only ~2x even on large scale factors.
Example:
-- Generate TPC-H benchmark data
INSTALL tpch;
LOAD tpch;
CALL dbgen(sf=1);
-- Mark customer as the privacy unit
ALTER TABLE customer ADD PAC_KEY (c_custkey);
ALTER TABLE customer SET PU;
-- Protect sensitive customer columns
ALTER PU TABLE customer ADD PROTECTED (c_custkey);
ALTER PU TABLE customer ADD PROTECTED (c_name);
ALTER PU TABLE customer ADD PROTECTED (c_address);
ALTER PU TABLE customer ADD PROTECTED (c_acctbal);
-- Define join chain: lineitem -> orders -> customer
ALTER TABLE orders ADD PAC_LINK (o_custkey) REFERENCES customer(c_custkey);
ALTER TABLE lineitem ADD PAC_LINK (l_orderkey) REFERENCES orders(o_orderkey);
-- Aggregates on linked tables are automatically noised
SELECT l_returnflag, l_linestatus, SUM(l_extendedprice)
FROM lineitem GROUP BY ALL;
Works with joins, subqueries (correlated & uncorrelated), CTEs, GROUP BY, HAVING, ORDER BY, LIMIT. Also runs in your browser via WASM (shell.duckdb.org).Paper: https://arxiv.org/abs/2603.15023
Extension page: https://duckdb.org/community_extensions/extensions/pac
We're looking for feedback, especially on edge cases, usability and query coverage.