Comprehensive guide to LLM training data cleaning: sources, preprocessing, deduplication, normalization, language detection, PII removal, toxicity filtering, quality scoring, and anomaly detection. Includes Python code and open-source tools using NLTK, spaCy, and FastText, with reproducible pipelines, logging, metrics, and real-world examples. Best practices for scalable datasets, compliance, and model performance.