post thumbnail

pandas Data Cleaning for Web Scraping: From HTML Tables to Clean Datasets

Master pandas for efficient data cleaning: handle missing values (dropna(), fillna()), remove duplicates (drop_duplicates()), convert data types (astype(), to_datetime()), and clean strings (str.replace()). Learn advanced merging (merge(), concat()) and regex extraction for polished datasets.

2025-10-04

pandas data cleaning for web scraping is one of the fastest ways to turn messy crawler outputs into analysis-ready datasets. In this chapter, you will learn how to extract tables from web pages, convert text fields into numeric values, handle missing and duplicate rows, and merge multiple datasets efficiently.

Before we start, note that each tutorial in this series is independent. However, if you want the full learning path, you can review the earlier chapters here:

Outbound references (authoritative docs):

read_html docs: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

pandas docs: https://pandas.pydata.org/docs/

9. pandas data processing

pandas is one of the most practical libraries in the Python ecosystem for cleaning, transforming, and merging crawler data. Therefore, if you master pandas well, you will process data faster and more consistently than relying on spreadsheets.

Install pandas

pip install pandas

9.1 Extract HTML tables with 

read_html

In many scraping scenarios, the page already contains structured tables. Instead of manually parsing HTML, you can let pandas extract tables directly.

For example, you can read tables from Wikipedia:

import pandas as pd

tables = pd.read_html("https://en.wikipedia.org/wiki/Minnesota")
print(f"Total tables: {len(tables)}")

If you only want a specific table, you can match by table title text:

tables = pd.read_html(
    "https://en.wikipedia.org/wiki/Minnesota",
    match="Election results from statewide races"
)
df = tables[0]
df.head()

Next, check column types:

df.info()

9.2 Convert percentage strings to numeric values

Often, scraped numbers come with symbols such as %. As a result, direct conversion will fail:

df["GOP"].astype("float")

You can remove % first and then convert:

df["GOP"].replace({"%": ""}, regex=True).astype("float")

If you want to convert multiple columns at once, remove % globally and then apply numeric conversion:

df = df.replace({"%": ""}, regex=True)
df[["GOP", "DFL", "Others"]] = df[["GOP", "DFL", "Others"]].apply(pd.to_numeric)
df.info()

This step is typical in pandas data cleaning for web scraping, because websites often format numbers for humans rather than for analysis.


9.3 Handle missing values (NaN)

Missing values can distort metrics and aggregations. Therefore, you should detect them early.

Detect missing values: 

isnull()

import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, None, 4],
    "B": [None, 2, 3, 4],
    "C": [1, None, None, 4]
})

print(df.isnull())

Count missing values per column: 

isnull().sum()

print(df.isnull().sum())

Remove missing rows or columns: 

dropna()

df_cleaned_rows = df.dropna()        # drop rows with any NaN
df_cleaned_cols = df.dropna(axis=1)  # drop columns with any NaN

In practice, you often combine dropping with filling. For example, you may fill categorical fields with “unknown” and keep numeric fields for analysis.


9.4 Handle duplicate values

Scraped data often contains duplicates because pages repeat items or APIs return overlapping results. So you should detect and remove duplicates.

Detect duplicates: 

duplicated()

import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, 2, 4],
    "B": [5, 6, 6, 8]
})

print(df.duplicated())

Drop duplicates: 

drop_duplicates()

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

9.5 Common string and type conversions

Crawler outputs frequently store numbers and time as strings. Therefore, you should normalize them before analysis.

Useful pandas operations:

Example:

import pandas as pd

df = pd.DataFrame({
    "A": ["1", "2", "3", "4"],
    "B": ["2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01"],
    "C": ["  hello ", " world ", " old_value ", " new_value "]
})

df["A"] = df["A"].astype(float)
df["B"] = pd.to_datetime(df["B"], errors="coerce")

df["C"] = df["C"].str.strip()
df["C"] = df["C"].str.lower()
df["C"] = df["C"].replace({"old_value": "new_value"})

print(df)

9.6 Regular expressions on columns

Regex is common in scraping, especially for extracting and standardizing text.

Extract with regex

df["phone_area"] = df["phone"].str.extract(r"\((\d{3})\)")

Vectorized string concatenation

df["name"] = df["first_name"].str.cat(df["last_name"], sep=" ")

Fuzzy matching (optional)

If you need approximate matching, you can use fuzzywuzzy:

pip install fuzzywuzzy
from fuzzywuzzy import fuzz
df["similarity"] = df.apply(lambda x: fuzz.ratio(x["name1"], x["name2"]), axis=1)

9.7 High-performance data merging

After pandas data cleaning for web scraping, you often merge datasets, for example:

Fast merging based on index: 

join()

result = df1.join(df2, how="left")

Example:

import pandas as pd

df1 = pd.DataFrame({"A": ["A0", "A1"], "B": ["B0", "B1"]}, index=["K0", "K1"])
df2 = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])

print(df1.join(df2))

Merge by key: 

merge()

result = pd.merge(df1, df2, on="key", how="outer", indicator=True)

The indicator=True flag helps you track which rows come from which table. As a result, debugging merges becomes easier.

Concatenate by axis: 

concat()

pd.concat([df1, df2], axis=1, keys=["2024", "2025"])

concat() works well when you stack rows or align columns by index. Moreover, it supports join=”inner” to keep only overlapping columns.


9.8 Save cleaned data into a database

Once your dataset has no obvious missing values, duplicates, or formatting issues, you can store it.

For example, you can save a DataFrame into SQL using to_sql:

df.to_sql("table_name", con)

At this point, the stored data is cleaner, more consistent, and ready for downstream queries, dashboards, or APIs.


Summary

This chapter showed a practical workflow for pandas data cleaning for web scraping: extract tables, normalize numeric fields, handle missing/duplicate values, apply string cleanup, use regex, merge datasets, and finally store results into a database. As a result, your crawler output becomes analysis-ready data instead of raw noise.