Training Data Pipelines for Generative AI: Deduplication, Filtering, and Mixture Design