In the previous article, we introduced text similarity detection techniques based on Levenshtein distance and TF-IDF with cosine similarity.
In this article, we move forward and explore semantic-level text similarity, which plays a critical role in web scraping data cleaning, deduplication, and downstream model fine-tuning.
Sentence-Transformers for Semantic Text Similarity
Sentence-transformers map sentences, paragraphs, or even long documents into a high-dimensional vector space.
In this vector space, semantically similar texts stay closer together, which allows systems to compute similarity efficiently using vector operations such as cosine similarity.
Sentence-transformers build on top of pre-trained language models like BERT and MiniLM. The framework optimizes these models specifically for sentence-level semantic understanding.

How Sentence-Transformers Encode Text
1. Text Encoding with Pre-trained Language Models
Sentence-transformers reuse language models that developers have already trained on massive corpora.
These base models understand vocabulary, grammar, and general semantics before any task-specific training begins.
The encoding process works as follows:
- The tokenizer splits input text into subwords.
- The pre-trained model processes these tokens.
- The system outputs a fixed-length embedding vector.
For example, the sentence:
“I love natural language processing”
produces a vector with 384 or 768 dimensions, depending on the selected model.
2. Sentence-Level Semantic Pooling
Pre-trained models generate vectors for each token, not for the entire sentence.
Sentence-transformers combine token vectors into a single sentence embedding using pooling strategies.
Common pooling strategies include:
- Mean Pooling – averages all token vectors and balances semantic contribution
- Max Pooling – highlights the most salient features
- CLS Pooling – uses the
[CLS]token representation (model-dependent)
Mean pooling remains the most widely used approach in production systems.
3. Fine-Tuning for Semantic Relevance
Sentence-transformers improve semantic alignment through fine-tuning on labeled sentence pairs.
During training, the system:
- Pulls semantically similar sentences closer together
- Pushes unrelated sentences farther apart
Common optimization strategies include:
- Contrastive Learning
- Triplet Loss
After fine-tuning, the model handles tasks such as synonym detection, paraphrase identification, and semantic equivalence more accurately.
4. Similarity Calculation
After generating embeddings, the system calculates similarity using cosine similarity or Euclidean distance.
A cosine similarity score closer to 1 indicates stronger semantic similarity.
Installing Sentence-Transformers
Install the library with pip:
pip install sentence-transformers
The first execution downloads the selected model automatically from Hugging Face.
Basic Example: Semantic Similarity Detection
Test Sentences
text_a = "Artificial intelligence is transforming modern society through automation and data analysis."
text_b = "Machine learning algorithms are changing contemporary culture by automating processes and analyzing information."
text_c = "Climate change affects global weather patterns and requires immediate environmental action."
Similarity Calculation Code
from sentence_transformers import SentenceTransformer, util
MODEL_PATH = "./local-models/all-MiniLM-L6-v2"
def calculate_similarity(text1, text2, model):
embedding1 = model.encode(text1, convert_to_tensor=True)
embedding2 = model.encode(text2, convert_to_tensor=True)
return util.cos_sim(embedding1, embedding2).item()
def main():
model = SentenceTransformer(MODEL_PATH)
sim_ab = calculate_similarity(text_a, text_b, model)
sim_ac = calculate_similarity(text_a, text_c, model)
print(f"Similarity A–B: {sim_ab:.4f}")
print(f"Similarity A–C: {sim_ac:.4f}")
if __name__ == "__main__":
main()
Output Interpretation
Similarity A–B: 0.6630
Similarity A–C: 0.1917
The system correctly identifies that A and B share moderate semantic similarity, while A and C remain unrelated.
Why Fine-Tune Sentence-Transformers?
General-purpose models learn broad semantic rules, but domain-specific tasks require specialization.
Fine-tuning improves performance in scenarios such as:
- Medical document similarity
- Legal text comparison
- E-commerce review deduplication
- Customer support question matching
Empirical results often show 10%–30% performance gains after fine-tuning.
Fine-Tuning Workflow Overview
1. Data Preparation
Prepare labeled sentence pairs with similarity scores between 0 and 1.
High-quality annotations matter more than raw volume.
2. Base Model Selection
Recommended options:
- Chinese:
uer/sbert-base-chinese-nli - Multilingual:
paraphrase-multilingual-MiniLM-L12-v2 - English:
all-MiniLM-L6-v2
3. Loss Function Selection
- CosineSimilarityLoss – regression-style similarity
- ContrastiveLoss – binary similarity classification
- TripletLoss – anchor-positive-negative learning
4. Training Configuration
batch_size: 2–32num_epochs: 10–50warmup_steps: ~10% of total steps
Fine-Tuning Example Code
from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation, util
from torch.utils.data import DataLoader
import pandas as pd
import random
data = [
{"sentence1": "Artificial intelligence is transforming healthcare", "sentence2": "AI is revolutionizing medical services", "score": 0.91},
{"sentence1": "Natural language processing enables chatbots", "sentence2": "NLP powers conversational AI systems", "score": 0.93},
]
df = pd.DataFrame(data)
train_examples = [
InputExample(texts=[row["sentence1"], row["sentence2"]], label=row["score"])
for _, row in df.iterrows()
]
random.shuffle(train_examples)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)
model = SentenceTransformer("all-MiniLM-L6-v2")
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=15,
warmup_steps=10,
output_path="./fine-tuned-all-MiniLM-L6-v2"
)
Final Notes
- Sentence-transformers handle semantic similarity far beyond lexical overlap
- Fine-tuning aligns models with real-world domain requirements
- GPU acceleration significantly improves training speed