Python web scraping for image data cleaning and storage is a practical workflow when you need large-scale images for training and fine-tuning models. In many projects, you must not only scrape image URLs, but also download, convert formats (like WebP), standardize size, normalize pixels, enhance quality, and finally store images reliably for later use.
In this tutorial, we use a Stable Diffusion gallery page as an example to demonstrate an end-to-end pipeline: scrape → download → clean/process → store.
What you will build
By the end, you will have:
- A scraper that extracts image src URLs from a gallery page
- A downloader that saves images locally
- A processing toolkit for:
- WebP → JPEG conversion
- resizing and compression
- fixed-size normalization for YOLO-style input
- pixel normalization strategies
- denoising and sharpening
- Two MongoDB storage approaches:
- GridFS (store binary data, suited for large files)
- Path + metadata (recommended at scale)
Step 1: Find the image download URL (src)
Open the Stable Diffusion gallery in your browser and press F12 to open Developer Tools. Then, inspect the image element and locate the src attribute. That src is the download URL you want to extract.
Because pages can vary, always confirm the CSS structure before writing XPath.
Step 2: Parse image src URLs with XPath (Parsel + Requests)
Below is a minimal example to request the page HTML and parse image URLs using XPath.
from parsel import Selector
import requests
url = "https://stabledifffusion.com/gallery"
payload = {}
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'cookie': '_ga=GA1.1.258999226.1754806446; _ga_C4QP4FPRFF=GS2.1.s1754806445$o1$g1$t1754807302$j44$l0$h0',
'pragma': 'no-cache',
'referer': 'https://stabledifffusion.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36'
}
def get_stable_diffusion_images():
response = requests.request("GET", url, headers=headers, data=payload)
text = response.text
resp = Selector(text=text)
image_urls = resp.xpath('//div[@class="grid grid-cols-1 md:grid-cols-3 gap-4"]/div[@class="max-w-sm"]/img/@src').getall()
return image_urls
Example output:
[
"https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-1.webp",
"https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-2.webp",
"https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-3.webp",
"https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-4.webp",
"https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-5.webp",
"https://cdn.jsdelivr.net/gh/boringcdn/sd/sd-generate-6.webp"
]
Step 3: Download images locally
Next, write a download function. For reliability, keep it simple first, then add retries if your project needs it.
import requests
def download_image(image_url, filename):
response = requests.get(image_url)
if response.status_code == 200:
with open(filename, 'wb') as file:
file.write(response.content)
print(f"Image {filename} downloaded successfully.")
else:
print(f"Failed to download image {filename}. Status code: {response.status_code}")
Step 4: Convert WebP to JPEG with Pillow
Stable Diffusion gallery images are often in WebP format. Fortunately, Pillow can read WebP directly in most environments.
from PIL import Image
# open WebP format image
with Image.open("sd-generate-1.webp") as img:
# display image info
print(f"format: {img.format}")
print(f"size: {img.size}")
# save as jpeg format
img.save("image.jpg", "JPEG")
This step ensures compatibility with tools that prefer JPG/PNG.
Step 5: Resize and compress to reduce storage
Resizing is common because it saves storage and improves training throughput. Additionally, high-quality downsampling like LANCZOS reduces aliasing artifacts.
from PIL import Image
# Open WebP image
with Image.open("image.webp") as img:
print(f"Original format: {img.format}")
print(f"Original size: {img.size}") # (width, height)
new_width = img.size[0] // 2
new_height = img.size[1] // 2
new_size = (new_width, new_height)
resized_img = img.resize(new_size, Image.Resampling.LANCZOS)
print(f"New size: {resized_img.size}")
resized_img.save("resized_image.jpg", "jpeg") # Keep jpeg format
Resize to fixed input size (YOLO-style)
For YOLO training, images usually must be fixed-size (for example 640×640). Therefore, size normalization is required.
resized_img = img.resize((640, 640), Image.Resampling.LANCZOS)
resized_img.save("resized_image.jpg", "jpeg")
Step 6: Normalize pixel values (4 common methods)
Besides resizing, pixel normalization is another standard step for YOLO and general computer vision pipelines.
Supported methods:
- 0-1: map pixel values from [0,255] to [0,1]
- -0.5-0.5: map pixel values to [-0.5,0.5]
- z-score: standardize to mean 0 and std 1 (useful when lighting varies)
- uint8: denormalize back to [0,255] integers for saving/display
import cv2
import numpy as np
from PIL import Image
def normalize_pixel_values(image, method='0-1'):
"""
Image pixel value normalization function
Parameters:
image: Input image, can be a PIL Image or NumPy array
method: Normalization method
'0-1': Normalize to [0, 1] range
'-0.5-0.5': Normalize to [-0.5, 0.5] range
'z-score': Z-score standardization
'uint8': Convert to 0-255 integers (denormalization)
Returns:
Normalized image
"""
if isinstance(image, Image.Image):
image = np.array(image)
normalized = image.copy().astype(np.float32)
if method == '0-1':
if normalized.max() > 0:
normalized = normalized / 255.0
elif method == '-0.5-0.5':
normalized = (normalized / 255.0) - 0.5
elif method == 'z-score':
mean = np.mean(normalized)
std = np.std(normalized)
if std > 0:
normalized = (normalized - mean) / std
else:
normalized = normalized - mean
elif method == 'uint8':
normalized = np.clip(normalized, 0, 255).astype(np.uint8)
else:
raise ValueError(f"Unsupported normalization method: {method}")
return normalized
if __name__ == "__main__":
image_path = "input_image.jpg"
cv_image = cv2.imread(image_path)
cv_image_rgb = cv2.cvtColor(cv_image, cv2.COLOR_BGR2RGB)
pil_image = Image.open(image_path)
methods = ['0-1', '-0.5-0.5', 'z-score']
for method in methods:
normalized_cv = normalize_pixel_values(cv_image_rgb, method)
print(f"Method: {method}, OpenCV image - Pixel range: [{normalized_cv.min():.4f}, {normalized_cv.max():.4f}]")
normalized_pil = normalize_pixel_values(pil_image, method)
print(f"Method: {method}, PIL image - Pixel range: [{normalized_pil.min():.4f}, {normalized_pil.max():.4f}]")
normalized = normalize_pixel_values(cv_image_rgb, '0-1')
denormalized = normalize_pixel_values(normalized, 'uint8')
print(f"Denormalization - Pixel range: [{denormalized.min()}, {denormalized.max()}], Data type: {denormalized.dtype}")
cv2.imwrite("denormalized_image.jpg", cv2.cvtColor(denormalized, cv2.COLOR_RGB2BGR))
Step 7: Image quality optimization (denoise and sharpen)
Sometimes scraped images are noisy, low-contrast, or slightly blurry. In that case, you can enhance feature clarity before training.
Common processing options:
- obvious noise → non-local means denoising / bilateral filtering
- low contrast → CLAHE adaptive histogram equalization
- slight blur → Laplacian sharpening
- severe blur/low-res → super-resolution (Real-ESRGAN)
A basic OpenCV example for denoising:
import cv2
import numpy as np
from matplotlib import pyplot as plt
def denoise_image(image_path, method='non_local_means'):
"""
Image denoising processing
:param image_path: Input image path
:param method: Denoising method
'gaussian': Gaussian filtering
'median': Median filtering
'bilateral': Bilateral filtering
'non_local_means': Non-local means denoising
:return: Denoised image
"""
img = cv2.imread(image_path)
if img is None:
raise ValueError("Unable to read image, please check the path")
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
if method == 'gaussian':
denoised = cv2.GaussianBlur(img, (5, 5), 0)
elif method == 'median':
denoised = cv2.medianBlur(img, 5)
elif method == 'bilateral':
denoised = cv2.bilateralFilter(img, 9, 75, 75)
elif method == 'non_local_means':
denoised = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 21)
else:
raise ValueError(f"Unsupported denoising method: {method}")
denoised_rgb = cv2.cvtColor(denoised, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(10, 5))
plt.subplot(121), plt.imshow(img_rgb), plt.title('Original Image')
plt.subplot(122), plt.imshow(denoised_rgb), plt.title(f'Denoised ({method})')
plt.show()
return denoised
if __name__ == "__main__":
image_path = "lena.jpg"
denoised = denoise_image(image_path, method='non_local_means')
cv2.imwrite("denoised_lena.jpg", denoised)
Run:
python image_process.py
Step 8: Image storage strategies
For scalable storage, you can use Amazon S3 object storage or build your own storage engine. Meanwhile, MongoDB is also common, and there are two typical approaches:
- Store binary image data (suited for small images; large files use GridFS)
- Store image file path + metadata (recommended at scale and for high concurrency)
Method 1: Store image binary data in MongoDB (GridFS)
For files larger than 16MB, MongoDB recommends GridFS. GridFS splits a file into chunks (256KB by default), which works well for large images and videos.
from pymongo import MongoClient
from gridfs import GridFS
class MongoDBImageStorage:
def __init__(self, db_name="image_database"):
self.client = MongoClient('mongodb://localhost:27017/')
self.db = self.client[db_name]
self.fs = GridFS(self.db)
def store_image(self, image_path, metadata=None):
try:
with open(image_path, 'rb') as f:
image_data = f.read()
filename = image_path.split('/')[-1]
file_id = self.fs.put(
image_data,
filename=filename,
content_type=f'image/{filename.split(".")[-1]}',
**(metadata or {})
)
print(f"Image stored successfully. File ID: {file_id}")
return file_id
except Exception as e:
print(f"Error storing image: {str(e)}")
return None
def retrieve_image(self, file_id, output_path):
try:
file = self.fs.get(file_id)
image_data = file.read()
with open(output_path, 'wb') as f:
f.write(image_data)
print(f"Image retrieved successfully. Saved to: {output_path}")
return True
except Exception as e:
print(f"Error retrieving image: {str(e)}")
return False
def get_image_metadata(self, file_id):
try:
file = self.fs.get(file_id)
return {
"filename": file.filename,
"content_type": file.content_type,
"upload_date": file.upload_date,
"length": file.length,
"metadata": file.metadata
}
except Exception as e:
print(f"Error getting metadata: {str(e)}")
return None
def delete_image(self, file_id):
try:
self.fs.delete(file_id)
print(f"Image with ID {file_id} deleted successfully")
return True
except Exception as e:
print(f"Error deleting image: {str(e)}")
return False
if __name__ == "__main__":
storage = MongoDBImageStorage()
metadata = {"category": "nature", "resolution": "1920x1080"}
file_id = storage.store_image("test_image.jpg", metadata)
if file_id:
print("Image metadata:", storage.get_image_metadata(file_id))
storage.retrieve_image(file_id, "retrieved_image.jpg")
# storage.delete_image(file_id)
Method 2: Store image paths + metadata (recommended at scale)
For large images or high-concurrency usage, it is more efficient to store images in a filesystem (local disk, NAS, or cloud storage) and only save the path/URL and metadata in MongoDB.
In practice, this approach simplifies CDN delivery, reduces database load, and improves read performance. Moreover, it is easier to scale as a distributed crawler system.
You can then expose image access via an API using FastAPI or Express, and with a domain name, you can build an S3-like storage service.
Practical checklist for production pipelines
To keep your pipeline stable:
- Validate URLs before downloading
- Deduplicate images by hash (optional, but useful)
- Standardize output naming (include source + timestamp)
- Store metadata: source page, crawl time, format, size, checksum
- Separate raw images vs processed images folders
- Keep a failure log for retries
FAQ (for SEO + GEO)
What is python web scraping for image data cleaning and storage?
It is a pipeline that extracts image URLs from web pages, downloads images, converts and processes them (resize, normalize, enhance), and stores them in a database or object storage for later model training.
Why convert WebP images to JPEG?
Some training or annotation tools prefer JPEG/PNG. Converting also improves compatibility across environments.
Should I store image binaries in MongoDB?
You can, especially with GridFS for large files. However, at scale, storing paths (and using object storage) is usually more efficient.
What image size should I use for YOLO training?
Common sizes include 640×640 or 416×416. The correct choice depends on your model and dataset, but fixed-size input is a standard requirement.
Suggested image alt text
Use descriptive alt text that includes the topic naturally:
- Stable Diffusion gallery example for python web scraping image data cleaning and storage
- Find image src URL using browser devtools for scraping
- WebP to JPEG conversion with Pillow
- OpenCV denoising results for cleaner training images
- MongoDB GridFS image storage workflow
Conclusion
Python web scraping for image data cleaning and storage becomes much easier when you treat it as a structured pipeline: scrape URLs, download images, clean and normalize them, enhance quality when necessary, and store them with an approach that scales. With MongoDB GridFS or a path-based strategy plus cloud storage, your crawler can grow into a distributed system that centralizes image data for training and fine-tuning.