post thumbnail

Text Similarity Detection in Web Scraping Data Cleaning (Part 1)

Learn practical text similarity for web scraping data cleaning. Detect near-duplicate news articles using Levenshtein distance and TF-IDF cosine similarity, with steps, pros and cons, and Python scikit-learn code. Improve deduplication, relevance, and crawl efficiency by measuring similarity between documents to filter repeats and consolidate content across sites.

2025-11-11

Text similarity detection in web scraping data cleaning plays a critical role when collecting large-scale news and content data from multiple sources. Different websites often publish the same story or rewrite the same topic using slightly different wording.

To maintain data quality, scraping systems must determine whether two articles are similar. When similarity is high, the system can group articles under one topic or discard near-duplicate content. This article introduces two widely used approaches for text similarity detection in web scraping data cleaning, explains their principles, and analyzes their limitations.


Levenshtein Distance

Levenshtein distance (edit distance) measures how different two strings are. The algorithm counts the minimum number of single-character edits required to transform one string into the other.

The algorithm supports three operations:


Levenshtein Distance Example

Consider the strings “kitten” and “sitting”.

  1. “kitten” → “sitten” (replace “k” with “s”)
  2. “sitten” → “sittin” (replace “e” with “i”)
  3. “sittin” → “sitting” (insert “g”)

The process needs 3 operations, so the Levenshtein distance equals 3.


Python Library: python-Levenshtein

Python provides a popular implementation called python-Levenshtein.

Install it with:

pip install python-Levenshtein

Now compare two sentences and compute a normalized similarity score.

Test Inputs

text_a = "The quick brown fox jumps over the lazy dog"
text_b = "The quick brown fox jumps over the sleepy dog"

Example Code

from Levenshtein import distance  # install python-Levenshtein first

def text_similarity_simple(text1, text2):
    edit_dist = distance(text1, text2)
    max_len = max(len(text1), len(text2))
    return 1 - edit_dist / max_len if max_len > 0 else 1.0

text_a = "The quick brown fox jumps over the lazy dog"
text_b = "The quick brown fox jumps over the sleepy dog"

print(f"Similarity between A and B: {text_similarity_simple(text_a, text_b):.2f}")  # ~0.91

The code outputs a similarity score of about 0.91, so the two sentences look very similar at the character level.


Why Levenshtein Distance Fails on Semantics

Levenshtein distance works well for spelling correction and simple string matching. However, the algorithm focuses on surface characters and ignores meaning. That design causes problems when you compare long text, complex semantics, or domain-specific content.

1) The algorithm ignores semantic meaning

Levenshtein distance counts edit operations. It does not evaluate what words mean.

This behavior creates two common errors:

Example:

The two words differ by one character, so the algorithm reports high similarity. In practice, the meanings have no relationship.


TF-IDF + Cosine Similarity (Word Frequency Matching)

Many data cleaning pipelines also use TF-IDF + cosine similarity to compare sentences or articles.

This approach uses two steps:

  1. TF-IDF assigns weights to words based on importance.
  2. Cosine similarity compares two TF-IDF vectors and returns a score between 0 and 1.

How TF-IDF Works

TF-IDF calculates a weight for each word by multiplying:

Step 1: TF (Term Frequency)

TF measures frequency inside one document.

A simple formula looks like this:

TF(t, d) = count(t in d) / total_words(d)

Example:

Document: d = "The cat chases the mouse"
Tokens: ["The", "cat", "chases", "The", "mouse"]

Step 2: IDF (Inverse Document Frequency)

TF alone overvalues common words such as “the” or “is”. IDF reduces their impact and highlights rare words.

A common IDF formula looks like this:

IDF(t) = log( TotalDocs / (DocsContaining(t) + 1) )

Example with 1000 documents:

Final TF-IDF Weight

TF-IDF(t, d) = TF(t, d) × IDF(t)


How Cosine Similarity Works

Cosine similarity compares two vectors.

For high-dimensional vectors, the formula is:

cos(θ) = (A · B) / (||A|| × ||B||)


Example: TF-IDF Similarity for Two Sentences

Python Test Code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(sentences):
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf_matrix = vectorizer.fit_transform(sentences)
    feature_names = vectorizer.get_feature_names_out()
    similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]).flatten()
    return similarities, feature_names, tfidf_matrix

sentences = [
    "I love reading books",
    "I enjoy reading novels",
]

similarities, features, tfidf_matrix = calculate_similarity(sentences)

print(f"Base sentence: {sentences[0]}\n")
for i, similarity in enumerate(similarities, 1):
    print(f"Sentence {i}: {sentences[i]}")
    print(f"Similarity score: {similarity:.4f}\n")

print("Key words from base sentence (with significant TF-IDF weights):")
base_vector = tfidf_matrix[0].toarray()[0]
top_indices = base_vector.argsort()[-5:][::-1]
for idx in top_indices:
    print(f"- {features[idx]}: {base_vector[idx]:.4f}")

Output

Base sentence: I love reading books

Sentence 1: I enjoy reading novels
Similarity score: 0.2020

Key words from base sentence (with significant TF-IDF weights):
- love: 0.6317
- books: 0.6317
- reading: 0.4494
- novels: 0.0000
- enjoy: 0.0000

The similarity score stays low because TF-IDF treats “love” and “enjoy” as unrelated words. It also treats “books” and “novels” as unrelated. Without synonym handling or semantic embeddings, TF-IDF cannot recognize the similarity in meaning.


Limitations of TF-IDF + Cosine Similarity

TF-IDF focuses on word statistics, not meaning. This design produces several weaknesses:

For example, these two sentences contain the same words but express opposite meaning:

TF-IDF often reports high similarity because it mostly compares word overlap.

TF-IDF also reacts strongly to rare words. In a sports corpus, a rare player name may dominate the vector. That effect can reduce similarity even when two articles discuss the same event.


When These Methods Still Help

Despite their limitations, Levenshtein distance and TF-IDF + cosine similarity remain popular in large-scale web scraping data cleaning.

Teams choose them because they:


What’s Next

In the next tutorial, we will cover more advanced text similarity detection methods that handle semantics more reliably.