Zero-retention at the inference layer — processing the prompt purely in memory, logging only metadata (latency, verdict, rule hits), and discarding the payload immediately — reduces the exposure surface considerably. It doesn't solve identity linkage at the payment layer, but it means there's nothing to subpoena, breach, or misuse at the content layer.
For regulated industries this distinction matters a great deal. "We never stored it" is a much stronger compliance position than "we stored it but encrypted it."
The way I'd think about it: run a local model for anything it can handle. Keep that stuff on your machine entirely. Then for the stuff that actually needs a more powerful model, don't just dump your whole context in — strip it down to just the specific thing you need answered, like "given X, what's Y" with no surrounding story. The hosted model just sees a decontextualized task, not what's actually going on.
You basically become your own privacy layer. More friction, but the provider never gets the full picture.
Legal department approved
This is enough for me personally. Besides which, large scale deanonymization is now powered by LLMs and agents. Even with all the VPN and such, you have so many "fingerprints" that it's actually pretty easy to narrow things down to the point you can know exactly who you are.