post thumbnail

Python Web Scraping: Using Elasticsearch for Data Cleaning

Learn Python web scraping with Elasticsearch for data cleaning. Index e-commerce product pages, analyze text with tokenization, stopwords, and stemming, define mappings and analyzers, and run keyword and fuzzy searches. Normalize, deduplicate, and structure product details, expose results via an API to power scalable search. Improve indexing, relevance, and ranking.

2025-11-11

When performing web scraping for e-commerce data, developers often need to store and query large volumes of unstructured text.
In particular, product descriptions usually contain rich information that traditional SQL databases struggle to search efficiently.

For example, when scraping Amazon product pages, each product includes structured fields such as name and price.
At the same time, it also contains a long free-text description that explains features and usage scenarios.

Under the “About this item” section, all content appears as plain text rather than structured data.

If you store this data in a relational database and run a query for the keyword “usb-c led light”, problems quickly appear.
Although the product description clearly relates to this concept, the exact phrase may not exist in the text.

As a result, an SQL query based on exact matching would return no results, even though the product is relevant.


Why Elasticsearch Is Needed for Web Scraping Data Cleaning

To solve this limitation, developers typically introduce a text-oriented search engine database.
In practice, Elasticsearch has become the most widely adopted solution for this purpose.

Elasticsearch is a distributed search engine built on Apache Lucene.
It supports full-text search, numeric queries, range queries, and large-scale data retrieval with high performance.

Because modern web scraping pipelines generate massive datasets, Elasticsearch plays a crucial role in data cleaning, indexing, and fast querying.


How Elasticsearch Enables Efficient Text Search

Elasticsearch achieves fast retrieval through Lucene’s inverted index structure.
This structure maps terms to the documents that contain them, enabling efficient lookups.

Specifically, the inverted index consists of:

When a query arrives, Elasticsearch splits the query into terms, looks them up in the dictionary, and then locates matching documents.


Query Execution Workflow in Elasticsearch

When a user or application sends a search request, Elasticsearch follows a multi-stage process.

First, the request reaches a coordinating node in the cluster.
Next, the coordinating node broadcasts the query to relevant shards and replicas.

Each shard executes the query locally and returns lightweight results.
Finally, the coordinating node merges these results and fetches the complete documents before responding to the client.

As a result, Elasticsearch delivers high-speed searches even on large datasets.


Installing Elasticsearch with Docker

To simplify deployment, this tutorial uses Linux + Docker.
This approach reduces configuration complexity and speeds up setup.

docker pull elasticsearch:8.10.4

mkdir -p /usr/local/elasticsearch/data /usr/local/elasticsearch/config /usr/local/elasticsearch/plugins
chmod 777 /usr/local/elasticsearch/data

Next, start the Elasticsearch container:

docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  -v /usr/local/elasticsearch/data:/usr/share/elasticsearch/data \
  -v /usr/local/elasticsearch/config:/usr/share/elasticsearch/config \
  -v /usr/local/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
  elasticsearch:8.10.4

After starting the container, verify the installation:

curl http://localhost:9200

If Elasticsearch returns version information in JSON format, the installation succeeded.


Tokenizers and Text Analysis in Elasticsearch

When working with non-English text, Elasticsearch requires an appropriate tokenizer.
Tokenizers split text into searchable terms and standardize them for indexing.

Common tokenizer operations include:

Elasticsearch supports multiple tokenizer types:

In e-commerce scenarios, combining IK and Pinyin tokenizers significantly improves search accuracy.


Installing the IK Tokenizer

To enable Chinese text analysis, install the IK tokenizer inside the container.

docker exec -it elasticsearch /bin/bash
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.10.4/elasticsearch-analysis-ik-8.10.4.zip
exit
docker restart elasticsearch

Once installed, Elasticsearch can properly tokenize Chinese product descriptions.


Scraping Product Data from Amazon

Next, define a list of product URLs to scrape.

product_urls = [
    "https://www.amazon.com/dp/B07VGRJDFY",
    "https://www.amazon.com/dp/B08N5WRWNW",
    "https://www.amazon.com/dp/B09V3KXJPB"
]

Then, iterate through each URL and extract product data.

for url in product_urls:
    product_data = scrape_amazon_product(url)
    if product_data:
        import_to_es(es, product_data, index_name)

The scraper extracts structured fields such as name, price, brand, and rating.
At the same time, it also captures unstructured text like product descriptions and features.


Creating an Elasticsearch Index and Mapping

Before inserting data, define a proper index mapping.
This step determines which fields use full-text analysis and which require exact matching.

"name": {
  "type": "text",
  "analyzer": "english_analyzer",
  "fields": {"keyword": {"type": "keyword"}}
}

Text fields use analyzers for tokenization.
In contrast, keyword fields support exact matches and aggregations.

This design directly affects search quality and performance.


Writing Data to Elasticsearch

To store data, call the es.index() method.

es.index(
    index=index_name,
    id=product_data['product_id'],
    body=product_data
)

By using the product ID as the document ID, the system avoids duplicate entries.
As a result, Elasticsearch maintains clean and consistent data.


Initializing the Elasticsearch Client

Finally, initialize the Elasticsearch connection:

es = Elasticsearch(["http://localhost:9200"])

Once connected, Elasticsearch can perform full-text searches, fuzzy queries, and relevance ranking.
This capability makes it ideal for web scraping data cleaning and retrieval.


Summary

In this tutorial, you learned how to use Elasticsearch for data cleaning in Python web scraping projects.
Specifically, you saw how Elasticsearch handles unstructured text, enables flexible keyword search, and improves retrieval accuracy.

In the next article, we will build an API service that queries product data stored in Elasticsearch and returns structured search results.