A few things surprised me:
- Delta compression mostly doesn’t work during training (deltas can be larger than the original)
- File-level deduplication (e.g. DVC) doesn’t capture most of the redundancy
- Almost all storage savings come from exact tensor identity, not partial overlap
For things like warm-start tree models and transfer learning, this ends up working really well. Curious if anyone has seen different behavior with larger models or different chunk sizes.