Continuing from our previous article:
10 Essential Python Data Cleaning Techniques for Web Scraping
3. JSON Data Cleaning
Most modern websites use JSON format for API responses. Let’s take the Star Wars character API as an example:
API URL: https://swapi.dev/api/people/
import requests
url = "https://swapi.dev/api/people/"
response = requests.get(url, verify=False)
json_data = response.json()
print(json_data['results']) # Access the character data
For JSON containing non-English characters (like Chinese), set proper encoding:
response.encoding = 'utf8'
json_data = response.json()
4. Storing Data in MongoDB (NoSQL Database)
When dealing with nested JSON data (like the films array in our Star Wars example), MongoDB provides better flexibility than traditional SQL databases.
Installation:
pip install pymongo
Sample code for MongoDB insertion:
import pymongo
# Connection setup
client = pymongo.MongoClient(f'mongodb://{user}:{password}@{host}:{port}')
db = client['db_spider']
collection = db['wars_star']
# Insert data with duplicate prevention
collection.create_index("name", unique=True)
collection.insert_many(json_data['results'], ordered=False)
Query examples:
// Find names containing "Le"
db.getCollection('wars_star').find({'name':/Le/})
// Find characters appearing in specific film
db.getCollection('wars_star').find({films: { $in: ['https://swapi.dev/api/films/1/']}})
5. Handling JavaScript Object Data (JSONP)
Some websites return data as raw JavaScript objects (like East Money’s stock data):
import demjson
# Extract JavaScript object
js_data = response.text[response.text.find('=') + 2: response.text.rfind(';')]
# Parse with demjson
raw_data = demjson.decode(js_data)
rank_list = [item.split(',') for item in raw_data['datas']]
6. Regular Expressions (The Universal Solution)
For ultimate flexibility in data extraction:
Single match (re.search):
# Extract follower count from HTML
html = '<div class="q-text">6,526 followers</div>'
match = re.search('>(.*?) followers<', html)
followers = int(match.group(1).replace(',', '')) # Returns 6526
Multiple matches (re.findall):
text = """Phone numbers: 18767543212 and 19767443218"""
phones = re.findall(r'\d{11}', text) # Returns ['18767543212', '19767443218']
These techniques cover common web scraping scenarios. For more on actual data collection methods, see:
Crawling HTML Pages: Python Web Scraping Tutorial
Stay tuned for more advanced data cleaning methods in our next installment!