Half the dataset being synthetic is interesting. I wonder what that actually means. They say that Datology needed 2048 H100s to generate the synthetic data. Does that mean they were generating data using other open weight LLMs? Seems like that would undermine the integrity of a "US based" dataset.
Presumably they wouldn't be training on synthetic data produced by anything less than a open frontier model and those are almost exclusively Chinese