post thumbnail

10 Essential Data Cleaning Methods Every Python Web Scraper Should Know(part1)

Master essential Python data cleaning methods for web scraping: XPath parsing, pandas read_html, regex, BeautifulSoup, and proxy IP handling. Learn to extract Fortune 500 data, bypass blocks, and export to Excel/MySQL efficiently. Boost your scraping productivity with these pro tips.

2025-09-17

For web scraping engineers, data cleaning and storage represent the final yet most tedious step in the workflow. In enterprises scraping thousands of websites, this often becomes a dedicated role – the data cleaning specialist.

Here are the top 10 most efficient data cleaning techniques from my daily scraping practice:

1. XPath

XPath is my most frequently used HTML parsing method – mastering it solves 90%+ scraping data cleaning challenges.

Use Case: When scraped data is embedded in HTML code.

Example: Extracting Fortune Global 500 company data (2024 rankings):

import requests
from parsel import Selector

url = "https://www.fortunechina.com/fortune500/c/2024-08/05/content_456697.htm"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
response.encoding = 'utf8'

selector = Selector(text=response.text)
companies = selector.xpath('//div[@class="hf-right word-img2"]/div[@class="word-table"]/div[@class="wt-table-wrap"]/table/tbody/tr')

for company in companies:
    rank = company.xpath('./td[1]/text()').get()
    name = company.xpath('./td[2]/a/text()').get()
    revenue = company.xpath('./td[3]/text()').get()
    profit = company.xpath('./td[4]/text()').get()
    country = company.xpath('./td[5]/text()').get()
    print(rank, name, revenue, profit, country)

Key XPath Syntax:

2. Pandas read_html

For tabular data in HTML, Pandas offers a one-line solution:

import pandas as pd
from io import StringIO

df = pd.read_html(StringIO(response.text))[0]
print(df.head())

# Export options
df.to_excel('fortune500.xlsx') 
df.to_sql('fortune500', create_engine('mysql+pymysql://user:pass@localhost/db'))

# Quick analysis
print(df['Country'].value_counts())

Pro Tip: When facing IP blocks, integrate proxy rotation:

proxy = "http://user:pass@proxy_ip:port"
response = requests.get(url, proxies={'http': proxy, 'https': proxy})

[Continued in next part…]

These methods form the core toolkit for efficient web data extraction and transformation. The complete code examples are available on [GitHub repository].