If the sharding key matches (or is a subset of) a join or group-by key, then identical values are local to a single shard, which can be processed independently.
This type of thing is typically done at large granularity (eg one shard per MPP compute node), but there are also benefits down to the core or thread level.
Another tip is that if no shard key is defined, hash the whole row as a default.
I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.
11PB on S3 would cost ~$250k per month / $3m per year.
HuggingFace has raised almost $400M.
Not saying it's nothing, but probably not a big deal to them (e.g. ~10 of their 400+ staff cost more).
Arrow is designed for zero copy ipc -- it is, by definition, an in-memory format that is therefore mmappable.
Parquet is an on-disk format, designed to be space efficient.
So for example, Parquet supports general purpose compression in addition to dictionary and RLE encodings. General purpose compression forces you to make copies, but if you're streaming from disk the extra cost of decompressing blocks is acceptable.
Arrow doesn't use general purpose compression because it would force copies to be made and dominate compute costs for data in memory.