No major dataset (FineWeb, RedPajama, C4) currently filters for AI-generated content.
But the scarier part is that nobody needs to try. The accidental contamination is already happening. Models train on web data, produce outputs that end up on the web, next generation trains on that. Dohmatob et al. showed 0.1% synthetic contamination is enough to cause measurable degradation. Right now no major dataset (FineWeb, RedPajama, C4) filters for AI-generated content.
What makes this harder to think about: data quality and model performance don't always follow "garbage in, garbage out." I wrote about a related paradox where Qwen2.5-Math trained with deliberately wrong reward signals still improved almost as much as with correct ones: https://ai.gopubby.com/false-rewards-make-ai-smarter-paradox...
Models are simultaneously fragile to recursive contamination and weirdly resilient to corrupted training signals. The picture is messier than either side suggests.