LLM Training Data Cleaning Tutorial

The pre-training data of LLMs is the core foundation of their capabilities, characterized by “massive scale”, “diverse types”, and “multi-domain coverage”. The goal is to enable the model to master language rules, world knowledge, and logical reasoning abilities by learning massive amounts of text.

LLM training data comes from a wide range of sources, mainly divided into the following categories:

1. Public text libraries: Structured, high-quality basic data: Wikipedia, academic paper libraries, classic book libraries, etc.

2. Web crawler data: Core source for coverage breadth: Text collections crawled from search engines, high-rated links on Reddit (a social news platform), including blogs, articles, novels, etc. Due to user filtering (high ratings indicate higher quality), it has become an important data source for early GPT models.

3. Books and publications: Carriers of in-depth knowledge: Books (especially non-fiction) contain systematic and in-depth knowledge, which is crucial for models to understand complex logic and professional fields.

4. Dialogue and interaction data: Optimizing interaction capabilities: Some models incorporate dialogue data to enhance their “communication” ability, making them more in line with human interaction habits.

5. Code data: Strengthening programming capabilities: Open-source projects on GitHub and content from Stack Overflow (a programmer Q&A platform) help models understand the “problem-code” correspondence.

Data Preprocessing: The Key from “Raw” to “Usable”

Raw data cannot be directly used for training and needs to go through strict processing:

Cleaning: Remove duplicate content, meaningless characters (such as garbled characters), and harmful information (discriminatory, violent text);
Desensitization: Delete personal privacy data (phone numbers, ID cards), sensitive information;
Standardization: Unify formats (such as uniform case, punctuation), filter low-quality text (such as short sentences, logically confusing content).

Data cleaning for large language models is a key step to improve model training effectiveness, mainly involving operations such as removing noise, handling duplicate content, filtering low-quality text, and standardizing formats.

import re
import string
import hashlib
from typing import List, Tuple
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK resources (required for first run)
nltk.download('stopwords')
nltk.download('punkt')

class LLMDataCleaner:
    def __init__(self):
        """Initialize data cleaning utility and load resources like stop words"""
        self.stop_words = set(stopwords.words('english'))  # English stop words
        # Extended custom stop words (can be extended as needed)
        self.custom_stop_words = {"http", "https", "www", "com", "html", "jpg", "png"}
        self.stop_words.update(self.custom_stop_words)

    def remove_duplicates(self, texts: List[str]) -> Tuple[List[str], int]:
        """
        Remove duplicate texts
        :param texts: List of texts
        :return: Tuple of (deduplicated text list, number of duplicates removed)
        """
        # Use hash values for fast duplicate detection
        seen = set()
        unique_texts = []
        duplicates = 0

        for text in texts:
            # Normalize text before hashing (ignore case and leading/trailing spaces)
            text_normalized = text.strip().lower()
            text_hash = hashlib.md5(text_normalized.encode()).hexdigest()

            if text_hash not in seen:
                seen.add(text_hash)
                unique_texts.append(text)
            else:
                duplicates += 1

        return unique_texts, duplicates

    def filter_short_texts(self, texts: List[str], min_length: int = 10) -> Tuple[List[str], int]:
        """
        Filter out excessively short texts (usually containing noise)
        :param texts: List of texts
        :param min_length: Minimum character length threshold
        :return: Tuple of (filtered text list, number of short texts removed)
        """
        filtered = []
        removed = 0

        for text in texts:
            if len(text.strip()) >= min_length:
                filtered.append(text)
            else:
                removed += 1

        return filtered, removed

    def clean_special_characters(self, text: str) -> str:
        """
        Clean special characters, garbled code, and extra spaces
        :param text: Original text
        :return: Cleaned text
        """
        # Remove URLs
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

        # Remove HTML tags
        text = re.sub(r'<.*?>', '', text)

        # Remove special characters and garbled code (retain basic punctuation and alphanumerics)
        text = re.sub(r'[^\w\s.,!?\'\"-]', '', text)

        # Merge multiple spaces into one
        text = re.sub(r'\s+', ' ', text).strip()

        return text

    def remove_stopwords(self, text: str, language: str = 'english') -> str:
        """
        Remove stop words (optional step, depends on model requirements)
        :param text: Original text
        :param language: Language (currently supports English)
        :return: Text with stop words removed
        """
        if language != 'english':
            return text  # Can be extended to support other languages

        words = word_tokenize(text)
        filtered_words = [word for word in words if word.lower() not in self.stop_words]
        return ' '.join(filtered_words)

    def normalize_case(self, text: str, case: str = 'lower') -> str:
        """
        Normalize text case (usually convert to lowercase to reduce vocabulary size)
        :param text: Original text
        :param case: Target case ('lower' or 'upper')
        :return: Case-normalized text
        """
        if case == 'lower':
            return text.lower()
        elif case == 'upper':
            return text.upper()
        return text

    def filter_low_quality_texts(self, texts: List[str], quality_threshold: float = 0.3) -> Tuple[List[str], int]:
        """
        Filter low-quality texts (based on proportion of non-punctuation characters)
        :param texts: List of texts
        :param quality_threshold: Threshold for proportion of non-punctuation characters
        :return: Tuple of (filtered text list, number of low-quality texts removed)
        """
        filtered = []
        removed = 0

        for text in texts:
            if not text:
                removed += 1
                continue

            # Calculate ratio of non-punctuation characters
            total_chars = len(text)
            punctuation_chars = sum(1 for c in text if c in string.punctuation)
            non_punct_ratio = (total_chars - punctuation_chars) / total_chars

            if non_punct_ratio >= quality_threshold:
                filtered.append(text)
            else:
                removed += 1

        return filtered, removed

    def detect_language(self, text: str) -> str:
        """
        Simple language detection (based on character set)
        :param text: Input text
        :return: Language code ('en'/'zh'/'other')
        """
        # Detect Chinese characters
        if re.search(r'[\u4e00-\u9fff]', text):
            return 'zh'
        # Detect English characters
        elif re.search(r'[a-zA-Z]', text):
            return 'en'
        else:
            return 'other'

    def process_batch(self, texts: List[str], min_length: int = 10, quality_threshold: float = 0.3) -> Tuple[List[str], dict]:
        """
        Complete processing pipeline for batch text cleaning
        :param texts: Original list of texts
        :param min_length: Minimum length threshold
        :param quality_threshold: Quality threshold for filtering
        :return: Tuple of (cleaned text list, processing statistics)
        """
        stats = {
            'original_count': len(texts),
            'duplicates_removed': 0,
            'short_texts_removed': 0,
            'low_quality_removed': 0,
            'other_removed': 0,
            'final_count': 0
        }

        # 1. Remove duplicate texts
        unique_texts, duplicates = self.remove_duplicates(texts)
        stats['duplicates_removed'] = duplicates

        # 2. Filter out short texts
        filtered_length, short_removed = self.filter_short_texts(unique_texts, min_length)
        stats['short_texts_removed'] = short_removed

        # 3. Clean special characters and normalize
        cleaned = []
        for text in filtered_length:
            # Clean special characters
            text_clean = self.clean_special_characters(text)
            # Normalize case (for English)
            lang = self.detect_language(text_clean)
            if lang == 'en':
                text_clean = self.normalize_case(text_clean, 'lower')
            cleaned.append(text_clean)

        # 4. Filter low quality texts
        high_quality, low_quality_removed = self.filter_low_quality_texts(cleaned, quality_threshold)
        stats['low_quality_removed'] = low_quality_removed

        # 5. Final statistics
        stats['final_count'] = len(high_quality)
        stats['other_removed'] = stats['original_count'] - stats['final_count'] - sum([
            stats['duplicates_removed'],
            stats['short_texts_removed'],
            stats['low_quality_removed']
        ])

        return high_quality, stats


# Usage example
if __name__ == "__main__":
    # Sample data (simulating raw text read from files)
    raw_texts = [
        "Hello world! This is a sample text for LLM training.   ",
        "Hello world! This is a sample text for LLM training.   ",  # Duplicate text
        "Bad text!!!???",  # Low quality text (too many punctuation)
        "Short.",  # Excessively short text
        "https://example.com - Check this website!",  # Contains URL
        "<p>HTML tagged text</p>",  # Contains HTML tags
        "中文文本示例，测试多语言处理。",  # Chinese text example
        "Another example with   multiple   spaces and special chars: @#$%"
    ]

    # Initialize cleaner
    cleaner = LLMDataCleaner()

    # Process text batch
    cleaned_texts, stats = cleaner.process_batch(
        raw_texts,
        min_length=8,
        quality_threshold=0.5
    )

    # Output results
    print("Cleaning statistics:")
    for key, value in stats.items():
        print(f"{key}: {value}")

    print("\nCleaned texts:")
    for i, text in enumerate(cleaned_texts, 1):
        print(f"{i}. {text}")

The input texts in the example are stored in the list called raw_texts, and then go through the data cleaning function process_batch for multiple data cleaning processes:

Removing duplicates, removing overly short texts, removing useless characters and non-punctuation marks, and converting all to lowercase.

Output results:

(py11) D:\github\crawler_data_processing>python llm_data_processing_en.py
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yagam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yagam\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
Cleaning statistics:
original_count: 8
duplicates_removed: 1
short_texts_removed: 1
low_quality_removed: 0
other_removed: 0
final_count: 6

Cleaned texts:
1. hello world! this is a sample text for llm training.
2. bad text!!!???
3. - check this website!
4. html tagged text
5. 中文文本示例测试多语言处理
6. another example with multiple spaces and special chars

The above example is a basic NLP operation. Next, we use some longer sentences as advanced test samples.

The cleaning tasks this time are:

Desensitize sensitive data, such as phone numbers, emails, ID cards, etc. When encountering such data, replace them with specified characters to protect users’ privacy. Otherwise, if a crawler program crawls user A’s phone number and stores it in the training data, when a user asks for user A’s phone number while using the model, user A’s data will be leaked.

Then perform multi-language noise cleaning. For example, there are some English abbreviations mixed in Chinese. Keep the meaningful ones and remove the meaningless ones.

Finally, according to the length of the sentences, split overly long texts (usually exceeding the model’s maximum input length) into multiple short text chunks with complete semantics, to avoid information loss or truncation problems caused by overly long texts during model training or inference.

The split_long_text method is very important in large language model data preprocessing. It ensures that the input text meets the model’s length limit while retaining the semantic coherence of the text to the greatest extent.

import re
import string
import hashlib
import spacy
import fasttext
import dask.bag as db
from dask.diagnostics import ProgressBar
from typing import List, Tuple, Dict
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import numpy as np
from langdetect import detect, LangDetectException

# ----------------------
# Resource Initialization
# ----------------------
# Load NLP models (download first if needed)
# spacy download en_core_web_lg
# spacy download zh_core_web_lg
try:
    nlp_en = spacy.load("en_core_web_lg")  # For English NER and parsing
    nlp_zh = spacy.load("zh_core_web_lg")  # For Chinese NER
except:
    print("Warning: SpaCy models not found. Sensitive info detection may be limited.")
    nlp_en = None
    nlp_zh = None

# FastText for language detection (more accurate than regex)
# Download model: https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
try:
    ft_model = fasttext.load_model('lid.176.bin')
except:
    print("Warning: FastText model not found. Using fallback language detection.")
    ft_model = None

# Load quality assessment model (assesses text coherence/information density)
quality_model_name = "microsoft/xtremedistil-l6-h384-uncased"
quality_tokenizer = AutoTokenizer.from_pretrained(quality_model_name)
quality_model = AutoModelForSequenceClassification.from_pretrained(
    quality_model_name, 
    num_labels=2  # 0: low quality, 1: high quality (fine-tuned on custom data)
)
quality_pipeline = pipeline(
    "text-classification",
    model=quality_model,
    tokenizer=quality_tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# ----------------------
# Advanced Cleaner Class
# ----------------------
class AdvancedLLMCleaner:
    def __init__(self):
        # Sensitive pattern database (extended)
        self.sensitive_patterns = {
            "email": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
            "phone": r'\b(?:\+?86)?1[3-9]\d{9}\b',  # Chinese phone
            "id_card": r'\b\d{17}[\dXx]\b',  # Chinese ID
            "credit_card": r'\b(?:\d{4}[-\s]?){3}\d{4}\b'
        }
        # Harmful keywords (example categories)
        self.harmful_keywords = {"violence", "discrimination", "hate", "terrorism"}

    def detect_language_advanced(self, text: str) -> Tuple[str, float]:
        """
        Advanced language detection with confidence score
        Returns (language code, confidence)
        """
        if not text.strip():
            return ("unknown", 0.0)

        try:
            if ft_model:
                predictions = ft_model.predict(text, k=1)
                lang = predictions[0][0].replace("__label__", "")
                confidence = predictions[1][0]
                return (lang, confidence)
            else:
                # Fallback to langdetect
                lang = detect(text)
                return (lang, 0.8)  # Assume lower confidence
        except (LangDetectException, IndexError):
            return ("unknown", 0.0)

    def split_long_text(self, text: str, lang: str = "en", max_tokens: int = 512) -> List[str]:
        """
        Split long text into semantic chunks (avoid splitting sentences)
        """
        nlp = nlp_en if lang == "en" else nlp_zh if lang == "zh" else None
        if not nlp or not text.strip():
            return [text]

        doc = nlp(text)
        sentences = [sent.text for sent in doc.sents]
        chunks = []
        current_chunk = []
        current_length = 0

        for sent in sentences:
            sent_tokens = len(nlp(sent))
            if current_length + sent_tokens <= max_tokens:
                current_chunk.append(sent)
                current_length += sent_tokens
            else:
                if current_chunk:
                    chunks.append(" ".join(current_chunk))
                current_chunk = [sent]
                current_length = sent_tokens

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

    def filter_semantic_quality(self, texts: List[str], threshold: float = 0.7) -> List[str]:
        """
        Filter texts based on semantic quality (using pre-trained classifier)
        """
        high_quality = []
        # Batch processing to improve efficiency
        for i in range(0, len(texts), 32):
            batch = texts[i:i+32]
            results = quality_pipeline(batch)
            for text, res in zip(batch, results):
                if res["label"] == "LABEL_1" and res["score"] >= threshold:
                    high_quality.append(text)
        return high_quality

    def desensitize_text(self, text: str, lang: str = "en") -> str:
        """
        Deep desensitization: replace sensitive info with placeholders
        """
        # 1. Pattern-based replacement
        for name, pattern in self.sensitive_patterns.items():
            text = re.sub(pattern, f"[{name}_REDACTED]", text)

        # 2. NER-based replacement (names, addresses, organizations)
        nlp = nlp_en if lang == "en" else nlp_zh if lang == "zh" else None
        if nlp:
            doc = nlp(text)
            for ent in doc.ents:
                # Redact personal entities (customize based on your needs)
                if ent.label_ in ["PERSON", "GPE", "ORG", "DATE"]:  # GPE: countries/cities
                    text = text.replace(ent.text, f"[{ent.label_}_REDACTED]")

        # 3. Harmful content filtering
        for keyword in self.harmful_keywords:
            if keyword in text.lower():
                return ""  # Remove entirely if harmful content is found
        return text

    def remove_cross_lang_noise(self, text: str, primary_lang: str = None) -> str:
        """
        Remove mixed-language noise (e.g., English words in Chinese text with low info value)
        """
        if not primary_lang:
            primary_lang, _ = self.detect_language_advanced(text)
            if primary_lang == "unknown":
                return text

        # For Chinese text: remove English words with low semantic value
        if primary_lang == "zh":
            # Keep meaningful English terms (e.g., "AI", "GDP") but remove noise
            english_words = re.findall(r'[A-Za-z]+', text)
            for word in english_words:
                if len(word) < 3 and word.lower() not in {"ai", "it", "gdp"}:
                    text = text.replace(word, "")
        return text

# ---------------------
# Usage Example
# ----------------------
if __name__ == "__main__":
    cleaner = AdvancedLLMCleaner()

    # Example 1: Process a single long text
    long_text = """
    Dr. John Smith ([email protected]) delivered a speech on AI in Beijing. 
    He mentioned that 80% of data scientists use Python. His phone number is 13800138000.
    这是一段包含英文单词的中文文本，其中夹杂着一些 short 英文单词。
    """
    lang, _ = cleaner.detect_language_advanced(long_text)
    desensitized = cleaner.desensitize_text(long_text, lang)
    filtered = cleaner.remove_cross_lang_noise(desensitized, lang)
    chunks = cleaner.split_long_text(filtered, lang)
    print("Processed chunks:")
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1}: {chunk}")

    # Example 2: Semantic quality filtering
    sample_texts = [
        "Good morning! How are you?",  # High quality
        "Asdflkj qwerpoi 12345...",    # Low quality
        "The quick brown fox jumps over the lazy dog."  # High quality
    ]
    high_quality = cleaner.filter_semantic_quality(sample_texts)
    print("\nHigh quality texts after filtering:", high_quality)

The output of running the above code is:

Chunk 1: Dr. [PERSON_REDACTED] ([email_REDACTED]) delivered a speech on AI in [GPE_REDACTED]. He mentioned that 80% of data scientists use Python. His phone number is [phone_REDACTED].

Chunk 2: 这是一段包含英文单词的中文文本，其中夹杂着一些 英文单词。

High quality texts after filtering: [
    "Good morning! How are you?", 
    "The quick quick brown fox jumps over the lazy dog."
]

After cleaning the original natural language with messy information, we can get clean and tidy data, which is then vectorized into the trainer to provide high-quality corpus support for LLM.