Show HN: We are building Git for data

9 pointsby mmnb12 days ago1 comment

zakeerms11 days ago
This is a strong idea. Data engineers have lacked a true git like workflow for data for a long time especially around safe experimentation, schema evolution, and fast rollback.
Treating data, schema, and ETL as versioned first-class assets makes a lot of sense, and the AI angle only works because rollback is built in.
Curious how you handle branching/merging at scale and how this compares to Iceberg/Delta time travel. If you can nail trust, observability, and cost predictability, this could be a meaningful primitive for data engineers.
- mmnb11 days ago
  Thanks!
  > Curious how you handle branching/merging at scale and how this compares to Iceberg/Delta time travel.
  At a high level, we are using the Iceberg primitives for data versioning, and this should help give customers confidence about our reliability. At the same time, we also separately version schema and ETL too, and then tie these all together (versions of data, schema and ETL) so that rollbacks are simple and smooth. Did I mention that we support cascading rollbacks :D
  On branching, yes, we support branching of your whole datalake (again supported by the strong primitives discussed above). This means that a typical workflow would look like: (1) create a new branch for a new feature or major schema refactor (2) make changes, and test them out (3) once confident, promote the changes to prod.
  > If you can nail trust, observability, and cost predictability, this could be a meaningful primitive for data engineers.
  Yes, this is what we are thinking too, and we are working on some features around this.