post thumbnail

Web Crawling & Data Collection Basics Guide

This beginner guide explains web crawling basics, including workflows, tools, storage, and compliance. Learn how to collect public web data efficiently using Python, Scrapy, and modern crawling practices.

2026-01-04

Web crawling basics explain how automated systems collect public data from websites at scale.For a complete overview, see our web scraping API guide.In the digital age, data has become a core production factor.This guide introduces web crawling basics and explains how automated data collection turns public web pages into structured datasets for analysis. Whether it’s business analysis, academic research, or product development, all rely on high-quality data support. Web crawling and data collection basics help you automate the process: you can collect public information from massive web pages efficiently, then turn it into structured data for analysis. This guide covers web crawling and data collection basics with practical examples and compliance tips.To deepen your understanding of crawler design and its systemic challenges, see our detailed exploration of crawler technology principles and architecture, and learn how industry-grade crawlers manage scale and robustness.


I. Core Concepts: What are Web Crawling and Data Collection?

1.1 Web Crawler

A web crawler (also called a web spider or web robot) is a program or script that automatically browses the World Wide Web (WWW) and crawls web page data according to preset rules. Its core job is to simulate human browsing behavior and batch obtain content such as text, images, links, and tables—without manual, page-by-page work.

Common crawler use cases include:

1.2 Data Collection

Data collection means obtaining raw data through different approaches. Web crawling is one important method, but not the only one. Data can also come from manual entry, API calls, surveys, sensors, database exports, and more.

Compared with other methods, crawling is especially useful when you need public web data at scale and no official API exists.

A core principle applies to every data source: legality, compliance, and respect for rights and interests. No matter which method you use, follow relevant laws and the data source’s rules.


II. Web Crawling Basics: Core Workflow

A web crawler usually follows this loop:

discover links → visit pages → extract data → store data → iterate

Below are the core steps.

2.1 Initialization: Determine Seed URLs

The crawler starts from seed URLs—the initial list of pages. Seed URLs decide the crawling scope. For example, if you want mobile phone product data on an e-commerce site, the seed URL can be that category page. The crawler first puts seed URLs into a queue.

2.2 Sending Requests: Obtaining Web Page Responses

The crawler takes a URL from the queue and sends an HTTP/HTTPS request. A request typically includes:

The server returns a response that includes HTML source code and status codes (for example: 200 success, 404 not found, 500 server error).

Introduction to HTTP requests: HTTP Request MethodsAttachment.tiff

Detailed introduction to status codes: HTTP Status Codes Explained: Meanings and Real-World UseAttachment.tiff

Understanding how HTTP works under the hood helps crawler requests behave more politely and reliably; see an authoritative overview of HTTP protocol fundamentals to solidify these concepts.

Note:

2.3 Parsing Responses: Extracting Target Data and New URLs

HTML is structured or semi-structured. Parsing extracts two types of information:

  1. Target data (product name, price, comments, etc.)
  2. New URLs (pagination links, detail pages)

Common parsing methods:

  1. Regular expressions (simple fixed patterns)
  2. HTML parsing libraries (Beautiful Soup, lxml)
  3. JSON parsing (AJAX endpoints often return JSON directly)

2.4 Storing Data: Persistent Saving

You should persist extracted data to avoid loss. Common storage options:

2.5 Loop Iteration: Updating the URL Queue

The crawler adds newly discovered URLs into the queue and repeats steps 2.2–2.4 until stop conditions are met (queue empty, page limit reached, time limit reached, etc.). This loop lets a crawler “go deeper” automatically.For a practical Python implementation that follows this workflow step by step with real data extraction, refer to our Python web crawler tutorial, which shows how these stages translate into runnable code.

2.6 Basic Example Code of Web Crawler

The following is a native crawler example based on Python 3:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin  # Used to splice relative URLs into absolute URLs

# ---------------------- 1. Initialization: Determine seed URLs and queue to be crawled ----------------------
seed_url = "https://example.com/mobile"  # Replace with an actual accessible public URL
to_crawl = [seed_url]  # Queue of URLs to be crawled
crawled = set()  # Set of crawled URLs (to avoid repeated crawling)
target_data = []  # Store extracted target data

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

# ---------------------- 2. Loop iteration: Execute crawling process ----------------------
max_crawl_count = 2
while to_crawl and len(crawled) < max_crawl_count:
    current_url = to_crawl.pop(0)
    if current_url in crawled:
        continue
    
    try:
        # ---------------------- 3. Send request: Obtain web page response ----------------------
        response = requests.get(current_url, headers=headers, timeout=10)
        response.raise_for_status()
        response.encoding = response.apparent_encoding
        crawled.add(current_url)
        print(f"Crawling: {current_url}")

        # ---------------------- 4. Parse response: Extract target data and new URLs ----------------------
        soup = BeautifulSoup(response.text, "html.parser")
        
        product_items = soup.find_all("div", class_="product-item")
        for item in product_items:
            product_name = item.find("h3", class_="product-name")
            product_price = item.find("span", class_="product-price")
            if product_name and product_price:
                target_data.append({
                    "name": product_name.get_text(strip=True),
                    "price": product_price.get_text(strip=True)
                })
        
        next_page = soup.find("a", class_="next-page")
        if next_page:
            next_page_url = urljoin(current_url, next_page.get("href"))
            if next_page_url not in crawled and next_page_url not in to_crawl:
                to_crawl.append(next_page_url)
    
    except Exception as e:
        print(f"Failed to crawl {current_url}: {str(e)}")
        continue

# ---------------------- 5. Store data: Persistently save to CSV ----------------------
with open("mobile_products.csv", "w", newline="", encoding="utf-8") as f:
    fieldnames = ["name", "price"]
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(target_data)

print(f"Crawling completed! A total of {len(crawled)} pages crawled, {len(target_data)} product data extracted, saved to mobile_products.csv")

Even though this code is concise, it already includes key crawler modules. Next, let’s look at the system components more clearly.


III. Core Components in Web Crawling Basics

A complete crawler usually includes the following components:

3.1 URL Manager

The URL manager maintains:

Its goal is to avoid duplicate crawling and circular crawling (A → B → A). Small crawlers often store URLs in memory, while large crawlers store URLs in databases to support resumable crawling.

3.2 Request Module (Downloader)

The downloader sends requests and receives responses while simulating browser behavior. Common tooling includes:

3.3 Parsing Module (Parser)

The parser extracts target fields and new links. Common options include regex, Beautiful Soup, lxml/XPath, and JSON parsing libraries. Parsing choices affect both accuracy and performance.

3.4 Storage Module (Storage)

The storage module persists cleaned data into files or databases. Choose storage based on volume and downstream requirements.

3.5 Scheduler

The scheduler coordinates URL management, downloading, parsing, and storage. It also controls crawl rhythm (delays), concurrency, retries, and exception handling.

To explore how modern crawlers handle dynamic content, proxy rotation, and modular scheduling in practice, see related guides on dynamic rendering with Playwright and proxy strategies for scraping.


IV. Entry-Level Crawler Tools and Technology Selection

4.1 Programming Language Selection

4.2 Recommended Entry-Level Tools

4.2.1 Basic Libraries

4.2.2 Crawler Frameworks

Scrapy is a popular Python crawling framework with built-in scheduling, concurrency, retry logic, and extensibility via middleware.

4.2.3 No-Code/Low-Code Tools

If you do not need complex custom logic, you can use no-code tools (e.g., Bazhuayu Collector, Houyi Collector). These tools use visual rules (drag-and-drop, clicking) without writing code.

4.3 Basic Usage of Scrapy Crawler Framework

4.3.1 What is Scrapy?

Scrapy is an open-source crawling framework in Python for efficient and structured data extraction. It provides a complete pipeline (request scheduling → parsing → storage) and suits larger data collection tasks.

4.3.2 Installation of Scrapy

pip install scrapy
scrapy startproject basic_scrapy_crawler
cd basic_scrapy_crawler
scrapy genspider product_crawler example.com

4.3.3 Writing a Basic Scrapy Crawler

Create a spider file under product_crawler/spiders/:

import scrapy
from basic_scrapy_crawler.items import BasicScrapyCrawlerItem 

class ProductCrawlerSpider(scrapy.Spider):
    name = "product_crawler"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/mobile"]

    def parse(self, response):
        product_items = response.xpath('//div[@class="product-item"]')
        for item in product_items:
            product_item = BasicScrapyCrawlerItem()
            product_item["name"] = item.xpath('.//h3[@class="product-name"]/text()').get(default="").strip()
            product_item["price"] = item.xpath('.//span[@class="product-price"]/text()').get(default="").strip()
            if product_item["name"] and product_item["price"]:
                yield product_item

        next_page = response.xpath('//a[@class="next-page"]/@href').get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(url=next_page_url, callback=self.parse, dont_filter=True)

Example project settings:

ROBOTSTXT_OBEY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {
    'basic_scrapy_crawler.pipelines.BasicScrapyCrawlerPipeline': 300,
}

FEEDS = {
    'mobile_products_scrapy.csv': {
        'format': 'csv',
        'fields': ['name', 'price'],
        'overwrite': True,
        'encoding': 'utf-8-sig'
    }
}

Run the crawler:

scrapy crawl product_crawler

If you are considering which framework or library to use at scale, a comparison of entry-level crawling tools vs production frameworks can clarify when to adopt each.


V. Key Processes and Precautions for Data Collection

5.1 Core Processes of Data Collection

  1. Requirement analysis (what data, from where, how much, for what)
  2. Target website analysis (structure, data loading, robots.txt)
  3. Crawler development (tools, headers, delays, storage)
  4. Testing and debugging (small-scale validation)
  5. Batch crawling (monitoring, stability checks)
  6. Data cleaning (deduplication, normalization, missing values)

5.2 Core Precautions: Compliance and Anti-Crawling Response

5.2.1 Principles of Compliance

5.2.2 Common Anti-Crawling Mechanisms and Countermeasures

Related proxy details:

OCR material: Train Your Own OCR Model from Scratch with PaddleOCRAttachment.tiff


VI. Summary and Advanced Directions

6.1 Web Crawling Basics: Summary

Web crawling and data collection basics focus on automatically collecting legal and compliant public data. The standard workflow is:

seed URLs → requests → parsing → storage → iteration

Core components include URL manager, downloader, parser, storage, and scheduler. Beginners can start with Python + Requests + Beautiful Soup, then move to Scrapy for stronger scalability and maintainability.

6.2 Advanced Directions

  1. Distributed crawlers (Redis/message queues for multi-node crawling)
  2. Dynamic rendering (Selenium/Playwright)
  3. Anti-bot and counter-strategies (fingerprints, proxies, rate control)
  4. Data visualization and analytics (Matplotlib, Tableau)

    When moving beyond custom crawlers, consider how web scraping APIs can offload anti-bot handling and offer scalable, production-grade data access.

Related Guides

  1. Web Crawler Technology: Principles, Architecture, Applications, and RisksAttachment.tiff
  2. Crawling HTML Pages: Python Web Scrawler TutorialAttachment.tiff
  3. Build a high-performance crawler with Rust asyncAttachment.tiff
  4. Analysis of Rust Async PrinciplesAttachment.tiff
  5. FRP Intranet Penetration for Web Crawling: Expose Internal Services SafelyAttachment.tiff
  6. ZeroTier Intranet Penetration for Web Crawling: No Public IP Required (Part 2)Attachment.tiff
  7. Tailscale Intranet Penetration for Web Crawling: Zero-Config Remote Access (Part 3)Attachment.tiff
  8. HTTP Protocol: The Invisible Foundation of the InternetAttachment.tiff
  9. HTTP Request MethodsAttachment.tiff
  10. HTTP Status Codes Explained: Meanings and Real-World UseAttachment.tiff