This is Part 3 of the 10 Essential Data Cleaning Methods for Python Web Scraping series,from https://dataget.ai/wp-admin/post.php?post=743&action=edit.
By systematically following this tutorial series, you can confidently handle around 90% of real-world web scraping data cleaning tasks, especially those involving unstructured news and article pages.
In this chapter, we focus on news content extraction for web scraping, introducing two highly practical tools: GNE (General News Extractor) and Newspaper3k.
7. GNE: General News Extractor
One day, I received a task from my boss – an Excel file filled with webpage URLs. “These are all competitor press releases,” he said. “I need you to extract and summarize the content from each URL so we can compare our product’s advantages.”
Just as I settled in with my tea, ready for a relaxed day, this urgent assignment landed on my desk. Hundreds of URLs needed processing, with daily progress reports required.
Even working at maximum speed, manually copying and pasting content would take weeks. That’s when my colleague – a recent graduate and Python developer – introduced me to GNE.
After half a day studying the documentation, I automated the entire extraction process. What was estimated as a week’s work got completed in a single day.
GeneralNewsExtractor (GNE) is a specialized tool for extracting article content, titles, authors, publication dates, images, and source code from news website HTML. It achieves near-perfect accuracy on hundreds of Chinese news platforms including Toutiao, NetEase News, and Sina.
Installation:
# Choose either method:
pip install --upgrade gne
# OR
pipenv install gne
Implementation Example (ChinaDaily):
from gne import GeneralNewsExtractor
import requests
url = "https://www.chinadaily.com.cn/a/202505/24/WS68317c10a310a04af22c1529.html"
def crawl(url):
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
return response.text
html = crawl(url)
extractor = GeneralNewsExtractor()
result = extractor.extract(html)
print('Title:', result['title'])
print('Content:', result['content'])
print('Publish Time:', result['publish_time'])
Advanced Features:
- Custom title XPath:
extractor.extract(html, title_xpath='//h5/text()') - Noise removal:
extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])
Note: GNE requires JavaScript-rendered HTML as input. For API-based sites, consider JSON extraction methods covered in Part 2.
8. Newspaper3k
While GNE specializes in content extraction, Newspaper3k offers built-in crawling capabilities and broader language support.
Installation:
pip install newspaper3k
Basic Usage:
from newspaper import Article
url = 'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
article = Article(url)
article.download()
article.parse()
print('Authors:', article.authors)
print('Date:', article.publish_date)
print('Content:', article.text)
Bulk Processing:
import newspaper
import pandas as pd
cnn_paper = newspaper.build('https://cnn.com')
articles_data = []
for article in cnn_paper.articles:
article.download()
article.parse()
articles_data.append({
'url': article.url,
'authors': article.authors,
'date': article.publish_date,
'content': article.text
})
pd.DataFrame(articles_data).to_excel('cnn_articles.xlsx', index=False)
This concludes Part 3 of our series. These tools can dramatically streamline content extraction workflows, transforming days of manual work into automated processes.