Python web scraping data cleaning does not end when you remove duplicates or fix formats. In fact, the final step is just as important: you must choose the right storage system for the cleaned data. Otherwise, you may collect high-quality data but still fail in querying, scaling, or analysis.
In this last chapter, we focus on storage selection schemes for different scenarios. You will learn when to use SQL databases, when to use MongoDB, and when Elasticsearch becomes the better choice for text retrieval.
Method 10: Choose the Right Storage for Cleaned Scraping Data
After Python web scraping data cleaning, your data usually falls into one of these three categories:
- Structured data that needs fast, frequent queries
- Semi-structured or unstructured JSON with changing fields
- Long text that needs semantic search, fuzzy match, and ranking
Therefore, you should match the storage engine to the primary usage, not to personal preference.
10.1 Structured data for high-frequency queries (SQL)
When your crawler collects e-commerce product data—such as product names, prices, categories, sales volume, reviews, and manufacturer fields—you typically serve it through Web APIs and dashboards. In that case, you should store the cleaned data in an SQL database, such as PostgreSQL or MySQL.
Moreover, SQL databases support:
- strong indexing,
- joins across tables,
- transactions for reliable writes,
- familiar query patterns for backend development.
Python CRUD with PostgreSQL (example)
Install the driver:
pip install psycopg2
Create a table and insert records:
import psycopg2
from psycopg2 import OperationalError
def create_connection(db_name, db_user, db_password, db_host, db_port):
try:
return psycopg2.connect(
database=db_name,
user=db_user,
password=db_password,
host=db_host,
port=db_port,
)
except OperationalError as e:
print(f"Connection error: {e}")
return None
def execute_query(connection, query):
connection.autocommit = True
with connection.cursor() as cursor:
cursor.execute(query)
if __name__ == "__main__":
conn = create_connection("your_database", "your_user", "your_password", "localhost", "5432")
if not conn:
raise SystemExit("DB connection failed")
create_users_table = """
CREATE TABLE IF NOT EXISTS users (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
age INTEGER,
gender TEXT,
nationality TEXT
);
"""
execute_query(conn, create_users_table)
insert_users = """
INSERT INTO users (name, age, gender, nationality)
VALUES
('James', 25, 'male', 'USA'),
('Leila', 32, 'female', 'France'),
('Brigitte', 35, 'female', 'England'),
('Mike', 40, 'male', 'Denmark'),
('Elizabeth', 21, 'female', 'Canada');
"""
execute_query(conn, insert_users)
conn.close()
When indexes are not enough: partitioning
As data grows, a single table can become too large. At that point, you can partition for performance and manageability. Generally, you can choose:
- Vertical partitioning: split by business domain or split wide tables into multiple tables
- Horizontal partitioning: split rows across shards (hash or range)
Vertical partitioning (concept)
[Single DB]
- User table
- Order table
- Product table
- Log table
[Split by domain]
[User DB] [Order DB] [Product DB] [Log DB]
You often query across tables using JOINs:
SELECT u.name, o.product_name, o.amount
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE u.user_id = 1;
Horizontal partitioning (hash example)
def get_shard(user_id: int, total_shards: int = 4) -> int:
return user_id % total_shards
You can also use range sharding, for example by user_id ranges or by time buckets, if your queries naturally filter by ranges.
10.2 Semi-structured and dynamic JSON data (MongoDB)
If your cleaned scraping output contains nested JSON and dynamic fields, SQL schemas can slow you down. For example, logs or request parameters often vary across events:
{
"timestamp": "2025-06-15T10:30:00",
"user_id": 123,
"action": "upload",
"request_params": {
"file_type": "jpg",
"resolution": "1920x1080",
"metadata": {"camera": "iPhone 15"}
}
}
In this case, MongoDB fits well because:
- you do not need a fixed schema,
- you can store nested documents directly,
- you can query nested fields with dot notation.
Python connect/insert/query with MongoDB
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure
def connect_to_mongodb():
client = MongoClient("mongodb://localhost:27017/")
try:
client.admin.command("ping")
return client["mydatabase"]
except ConnectionFailure as e:
raise RuntimeError(f"MongoDB connection failed: {e}")
def insert_and_query(db):
col = db["users"]
col.insert_one({
"name": "Alice",
"age": 30,
"email": "[email protected]",
"interests": ["reading", "swimming"]
})
col.insert_many([
{"name": "Bob", "age": 25, "email": "[email protected]", "interests": ["hiking"]},
{"name": "Charlie", "age": 35, "email": "[email protected]", "interests": ["reading", "photography"]},
])
for doc in col.find({"age": {"$gt": 30}}):
print("age>30:", doc["name"])
for doc in col.find({"interests": "reading"}):
print("likes reading:", doc["name"])
if __name__ == "__main__":
db = connect_to_mongodb()
insert_and_query(db)
Regex text search (exact matching)
MongoDB can search text with regex:
{"content": {"$regex": "How to use a proxy IP"}}
However, regex matches strings, not meaning. Therefore, if you need fuzzy matching, synonyms, and relevance ranking, you should consider a search engine database.
10.3 Full-text retrieval and fuzzy matching (Elasticsearch)
When your cleaned scraping data includes long text fields (articles, reviews, product descriptions), you often want:
- full-text search with tokenization,
- fuzzy matching for typos,
- ranking by relevance score,
- fast filtering + aggregations for analysis.
Elasticsearch solves these problems with an inverted index and search DSL. As a result, it can retrieve content that is semantically close even when the exact phrase differs.
Index mapping and data insertion (Python)
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
index_name = "products"
mapping = {
"mappings": {
"properties": {
"name": {"type": "text"},
"description": {"type": "text"},
"category": {"type": "keyword"},
"brand": {"type": "keyword"}
}
}
}
if not es.indices.exists(index=index_name):
es.indices.create(index=index_name, body=mapping)
docs = [
{"name": "Wireless Earbuds", "description": "Wireless earbuds with noise cancellation", "category": "Audio", "brand": "SoundWave"},
{"name": "Bluetooth Headphones", "description": "Over-ear headphones with long battery life", "category": "Audio", "brand": "AudioMax"},
{"name": "Smartwatch Pro", "description": "Smartwatch with fitness tracking", "category": "Wearables", "brand": "TechGiant"},
{"name": "Laptop Charger", "description": "Fast charging adapter for laptops", "category": "Accessories", "brand": "PowerPlus"},
]
for d in docs:
es.index(index=index_name, document=d)
es.indices.refresh(index=index_name)
Common query types (practical examples)
1) Fuzzy search (typo tolerance)
fuzzy_query = {
"query": {
"fuzzy": {
"name": {
"value": "wirless",
"fuzziness": 1
}
}
}
}
results = es.search(index=index_name, body=fuzzy_query)
2) Prefix search
prefix_query = {
"query": {
"prefix": {
"brand": {
"value": "aud"
}
}
}
}
results = es.search(index=index_name, body=prefix_query)
3) Wildcard search
wildcard_query = {
"query": {
"wildcard": {
"name": {
"value": "smart*"
}
}
}
}
results = es.search(index=index_name, body=wildcard_query)
Quick decision guide
Use this checklist after Python web scraping data cleaning:
- Choose PostgreSQL/MySQL when you need structured tables, strong constraints, and high-frequency API queries.
- Choose MongoDB when fields change often, documents are nested, and schema flexibility matters.
- Choose Elasticsearch when you need full-text search, fuzzy matching, synonyms, and relevance ranking.
Conclusion
Efficient cleaning is only half the job. After Python web scraping data cleaning, storage decides whether you can query, scale, and analyze smoothly. Therefore, pick the database based on how you plan to use the data: SQL for structured querying, MongoDB for dynamic documents, and Elasticsearch for text retrieval and fuzzy search.