News Content Extraction for Web Scraping: GNE vs Newspaper3k

This is Part 3 of the 10 Essential Data Cleaning Methods for Python Web Scraping series,from https://dataget.ai/wp-admin/post.php?post=743&action=edit.

By systematically following this tutorial series, you can confidently handle around 90% of real-world web scraping data cleaning tasks, especially those involving unstructured news and article pages.

In this chapter, we focus on news content extraction for web scraping, introducing two highly practical tools: GNE (General News Extractor) and Newspaper3k.

7. GNE: General News Extractor

One day, I received a task from my boss – an Excel file filled with webpage URLs. “These are all competitor press releases,” he said. “I need you to extract and summarize the content from each URL so we can compare our product’s advantages.”

Just as I settled in with my tea, ready for a relaxed day, this urgent assignment landed on my desk. Hundreds of URLs needed processing, with daily progress reports required.

Even working at maximum speed, manually copying and pasting content would take weeks. That’s when my colleague – a recent graduate and Python developer – introduced me to GNE.

After half a day studying the documentation, I automated the entire extraction process. What was estimated as a week’s work got completed in a single day.

GeneralNewsExtractor (GNE) is a specialized tool for extracting article content, titles, authors, publication dates, images, and source code from news website HTML. It achieves near-perfect accuracy on hundreds of Chinese news platforms including Toutiao, NetEase News, and Sina.

Installation:

# Choose either method:
pip install --upgrade gne
# OR
pipenv install gne

Implementation Example (ChinaDaily):

from gne import GeneralNewsExtractor
import requests

url = "https://www.chinadaily.com.cn/a/202505/24/WS68317c10a310a04af22c1529.html"

def crawl(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    return response.text

html = crawl(url)
extractor = GeneralNewsExtractor()
result = extractor.extract(html)

print('Title:', result['title'])
print('Content:', result['content'])
print('Publish Time:', result['publish_time'])

Advanced Features:

Custom title XPath: extractor.extract(html, title_xpath='//h5/text()')
Noise removal: extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])

Note: GNE requires JavaScript-rendered HTML as input. For API-based sites, consider JSON extraction methods covered in Part 2.

8. Newspaper3k

While GNE specializes in content extraction, Newspaper3k offers built-in crawling capabilities and broader language support.

Installation:

pip install newspaper3k

Basic Usage:

from newspaper import Article

url = 'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
article = Article(url)
article.download()
article.parse()

print('Authors:', article.authors)
print('Date:', article.publish_date)
print('Content:', article.text)

Bulk Processing:

import newspaper
import pandas as pd

cnn_paper = newspaper.build('https://cnn.com')
articles_data = []

for article in cnn_paper.articles:
    article.download()
    article.parse()

    articles_data.append({
        'url': article.url,
        'authors': article.authors,
        'date': article.publish_date,
        'content': article.text
    })

pd.DataFrame(articles_data).to_excel('cnn_articles.xlsx', index=False)

This concludes Part 3 of our series. These tools can dramatically streamline content extraction workflows, transforming days of manual work into automated processes.

Related Web Guides

This article is part of 10 most efficient data cleaning techniquestopic cluster.
You may also find the following guides useful:

Top Data Cleaning Techniques for Web Scraping Engineers (Part 1)
Python Data Cleaning for Web Scraping: JSON, MongoDB, and Regex Techniques(Part 2)
pandas Data Cleaning for Web Scraping: From HTML Tables to Clean Datasets(Part 4)