I work at Anam and am responsible for the preparation of our training data.
I built Metaxy to avoid re-running expensive data processing steps on irrelevant upstream changes.
It's an open-source Python framework that can track granular sample versions and is infrastructure-agnostic.
See the full blog post here: https://anam.ai/blog/metaxy.
Happy to answer questions.
P.S. While Claude Code helped me a lot, this is *not* a vibe-coded project: I treated AI code like any line could be wrong, created a massive test suite, and Metaxy has been running in our production for about 2.5 months now.