10 Essential Data Cleaning Methods for Python Web Scraping (Part 3)

This is the third installment in our web scraping data cleaning series, continuing from 10 Essential Data Cleaning Methods for Python Web Scraping (Part 2).

By carefully studying this tutorial series, you’ll be equipped to handle 90% of real-world data cleaning scenarios.

7. GNE: General News Extractor

One day, I received a task from my boss – an Excel file filled with webpage URLs. “These are all competitor press releases,” he said. “I need you to extract and summarize the content from each URL so we can compare our product’s advantages.”

Just as I settled in with my tea, ready for a relaxed day, this urgent assignment landed on my desk. Hundreds of URLs needed processing, with daily progress reports required.

Even working at maximum speed, manually copying and pasting content would take weeks. That’s when my colleague – a recent graduate and Python developer – introduced me to GNE.

After half a day studying the documentation, I automated the entire extraction process. What was estimated as a week’s work got completed in a single day.

GeneralNewsExtractor (GNE) is a specialized tool for extracting article content, titles, authors, publication dates, images, and source code from news website HTML. It achieves near-perfect accuracy on hundreds of Chinese news platforms including Toutiao, NetEase News, and Sina.

Installation:

# Choose either method:
pip install --upgrade gne
# OR
pipenv install gne

Implementation Example (ChinaDaily):

from gne import GeneralNewsExtractor
import requests

url = "https://www.chinadaily.com.cn/a/202505/24/WS68317c10a310a04af22c1529.html"

def crawl(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    return response.text

html = crawl(url)
extractor = GeneralNewsExtractor()
result = extractor.extract(html)

print('Title:', result['title'])
print('Content:', result['content'])
print('Publish Time:', result['publish_time'])

Advanced Features:

Custom title XPath: extractor.extract(html, title_xpath='//h5/text()')
Noise removal: extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])

Note: GNE requires JavaScript-rendered HTML as input. For API-based sites, consider JSON extraction methods covered in Part 2.

8. Newspaper3k

While GNE specializes in content extraction, Newspaper3k offers built-in crawling capabilities and broader language support.

Installation:

pip install newspaper3k

Basic Usage:

from newspaper import Article

url = 'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
article = Article(url)
article.download()
article.parse()

print('Authors:', article.authors)
print('Date:', article.publish_date)
print('Content:', article.text)

Bulk Processing:

import newspaper
import pandas as pd

cnn_paper = newspaper.build('https://cnn.com')
articles_data = []

for article in cnn_paper.articles:
    article.download()
    article.parse()

    articles_data.append({
        'url': article.url,
        'authors': article.authors,
        'date': article.publish_date,
        'content': article.text
    })

pd.DataFrame(articles_data).to_excel('cnn_articles.xlsx', index=False)

This concludes Part 3 of our series. These tools can dramatically streamline content extraction workflows, transforming days of manual work into automated processes.