1 pointby Sadam_H7 hours ago1 comment
  • Sadam_H7 hours ago
    Hey HN! I'm a student who built this over the past 5 months. Why I built this:Every project I worked on hit the same wall: I couldn't use real data due to HIPAA/GDPR, public datasets were too generic, and mocking data manually was painful. Existing tools like Gretel or Tonic are enterprise-priced and closed-source.

    So I built an open-source alternative that does two things: Schema mode: Define columns and generate upto 1M rows (no training data needed). ML mode: Upload a CSV to train CTGAN/TVAE/Copula and generate high-fidelity synthetic data.

    Tech stack: Frontend: Next.js 15, TypeScript, Tailwind

    Backend: FastAPI, PostgreSQL, Redis

    ML: SDV library (CTGAN, TVAE, GaussianCopula)

    Privacy: Differential privacy using $(\epsilon, \delta)$-probabilistic guarantees.

    Auth: Better Auth (self-hosted) Deployment: Docker Compose

    Hardest technical challenge:Getting differential privacy parameters right. The $\epsilon$ (epsilon) budget directly trades off between privacy and utility. Too strict makes the data useless; too loose causes privacy leaks. I ended up exposing this as a configurable slider with sensible defaults and documentation.

    Pricing/Openness:100% MIT licensed (fork it, host it, modify it). Self-host: docker-compose up and you're running. No tracking or data collection on self-hosted instances.

    Try it out:Live playground (no signup): https://www.synthdata.studio/playground

    GitHub:https://github.com/Urz1/synthetic-data-studio

    I’d love to hear your feedback on the architecture, privacy implementation, or what features would make this useful for your workflow!