Web crawling basics explain how automated systems collect public data from websites at scale.For a complete overview, see our web scraping API guide.In the digital age, data has become a core production factor.This guide introduces web crawling basics and explains how automated data collection turns public web pages into structured datasets for analysis. Whether it’s business analysis, academic research, or product development, all rely on high-quality data support. Web crawling and data collection basics help you automate the process: you can collect public information from massive web pages efficiently, then turn it into structured data for analysis. This guide covers web crawling and data collection basics with practical examples and compliance tips.To deepen your understanding of crawler design and its systemic challenges, see our detailed exploration of crawler technology principles and architecture, and learn how industry-grade crawlers manage scale and robustness.
I. Core Concepts: What are Web Crawling and Data Collection?
1.1 Web Crawler
A web crawler (also called a web spider or web robot) is a program or script that automatically browses the World Wide Web (WWW) and crawls web page data according to preset rules. Its core job is to simulate human browsing behavior and batch obtain content such as text, images, links, and tables—without manual, page-by-page work.
Common crawler use cases include:
- Search engines (Google, Baidu) crawl pages and build indexes for search.
- E-commerce teams monitor competitor prices and inventory.
- Academic researchers collect literature data for analysis.
- Enterprises track industry news, user reviews, and market signals for decision-making.
1.2 Data Collection
Data collection means obtaining raw data through different approaches. Web crawling is one important method, but not the only one. Data can also come from manual entry, API calls, surveys, sensors, database exports, and more.
Compared with other methods, crawling is especially useful when you need public web data at scale and no official API exists.
A core principle applies to every data source: legality, compliance, and respect for rights and interests. No matter which method you use, follow relevant laws and the data source’s rules.
II. Web Crawling Basics: Core Workflow
A web crawler usually follows this loop:
discover links → visit pages → extract data → store data → iterate
Below are the core steps.
2.1 Initialization: Determine Seed URLs
The crawler starts from seed URLs—the initial list of pages. Seed URLs decide the crawling scope. For example, if you want mobile phone product data on an e-commerce site, the seed URL can be that category page. The crawler first puts seed URLs into a queue.
2.2 Sending Requests: Obtaining Web Page Responses
The crawler takes a URL from the queue and sends an HTTP/HTTPS request. A request typically includes:
- method (GET/POST)
- headers (User-Agent, Cookie)
- parameters
The server returns a response that includes HTML source code and status codes (for example: 200 success, 404 not found, 500 server error).
Introduction to HTTP requests: HTTP Request Methods
Detailed introduction to status codes: HTTP Status Codes Explained: Meanings and Real-World Use
Understanding how HTTP works under the hood helps crawler requests behave more politely and reliably; see an authoritative overview of HTTP protocol fundamentals to solidify these concepts.
Note:
- User-Agent identifies the client. If you omit it or use an unrealistic value, servers may flag you as a bot.
- Cookie maintains login state and helps crawl content that requires authentication.
2.3 Parsing Responses: Extracting Target Data and New URLs
HTML is structured or semi-structured. Parsing extracts two types of information:
- Target data (product name, price, comments, etc.)
- New URLs (pagination links, detail pages)
Common parsing methods:
- Regular expressions (simple fixed patterns)
- HTML parsing libraries (Beautiful Soup, lxml)
- JSON parsing (AJAX endpoints often return JSON directly)
2.4 Storing Data: Persistent Saving
You should persist extracted data to avoid loss. Common storage options:
- TXT/CSV (simple structured data)
- Databases (MySQL for structured data; MongoDB for semi-structured or large volumes)
- Excel (small batches for manual viewing)
2.5 Loop Iteration: Updating the URL Queue
The crawler adds newly discovered URLs into the queue and repeats steps 2.2–2.4 until stop conditions are met (queue empty, page limit reached, time limit reached, etc.). This loop lets a crawler “go deeper” automatically.For a practical Python implementation that follows this workflow step by step with real data extraction, refer to our Python web crawler tutorial, which shows how these stages translate into runnable code.
2.6 Basic Example Code of Web Crawler
The following is a native crawler example based on Python 3:
import requests
from bs4 import BeautifulSoup
import csv
from urllib.parse import urljoin # Used to splice relative URLs into absolute URLs
# ---------------------- 1. Initialization: Determine seed URLs and queue to be crawled ----------------------
seed_url = "https://example.com/mobile" # Replace with an actual accessible public URL
to_crawl = [seed_url] # Queue of URLs to be crawled
crawled = set() # Set of crawled URLs (to avoid repeated crawling)
target_data = [] # Store extracted target data
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
# ---------------------- 2. Loop iteration: Execute crawling process ----------------------
max_crawl_count = 2
while to_crawl and len(crawled) < max_crawl_count:
current_url = to_crawl.pop(0)
if current_url in crawled:
continue
try:
# ---------------------- 3. Send request: Obtain web page response ----------------------
response = requests.get(current_url, headers=headers, timeout=10)
response.raise_for_status()
response.encoding = response.apparent_encoding
crawled.add(current_url)
print(f"Crawling: {current_url}")
# ---------------------- 4. Parse response: Extract target data and new URLs ----------------------
soup = BeautifulSoup(response.text, "html.parser")
product_items = soup.find_all("div", class_="product-item")
for item in product_items:
product_name = item.find("h3", class_="product-name")
product_price = item.find("span", class_="product-price")
if product_name and product_price:
target_data.append({
"name": product_name.get_text(strip=True),
"price": product_price.get_text(strip=True)
})
next_page = soup.find("a", class_="next-page")
if next_page:
next_page_url = urljoin(current_url, next_page.get("href"))
if next_page_url not in crawled and next_page_url not in to_crawl:
to_crawl.append(next_page_url)
except Exception as e:
print(f"Failed to crawl {current_url}: {str(e)}")
continue
# ---------------------- 5. Store data: Persistently save to CSV ----------------------
with open("mobile_products.csv", "w", newline="", encoding="utf-8") as f:
fieldnames = ["name", "price"]
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(target_data)
print(f"Crawling completed! A total of {len(crawled)} pages crawled, {len(target_data)} product data extracted, saved to mobile_products.csv")
Even though this code is concise, it already includes key crawler modules. Next, let’s look at the system components more clearly.
III. Core Components in Web Crawling Basics
A complete crawler usually includes the following components:
3.1 URL Manager
The URL manager maintains:
- the queue of URLs to crawl
- the set of crawled URLs
Its goal is to avoid duplicate crawling and circular crawling (A → B → A). Small crawlers often store URLs in memory, while large crawlers store URLs in databases to support resumable crawling.
3.2 Request Module (Downloader)
The downloader sends requests and receives responses while simulating browser behavior. Common tooling includes:
- Python Requests (simple, good for small crawlers)
- Scrapy downloader (concurrency, proxy support, cookies, retries)
- Java HttpClient, etc.
3.3 Parsing Module (Parser)
The parser extracts target fields and new links. Common options include regex, Beautiful Soup, lxml/XPath, and JSON parsing libraries. Parsing choices affect both accuracy and performance.
3.4 Storage Module (Storage)
The storage module persists cleaned data into files or databases. Choose storage based on volume and downstream requirements.
3.5 Scheduler
The scheduler coordinates URL management, downloading, parsing, and storage. It also controls crawl rhythm (delays), concurrency, retries, and exception handling.
To explore how modern crawlers handle dynamic content, proxy rotation, and modular scheduling in practice, see related guides on dynamic rendering with Playwright and proxy strategies for scraping.
IV. Entry-Level Crawler Tools and Technology Selection
4.1 Programming Language Selection
- Python (recommended): rich ecosystem, low learning cost, strong flexibility.
- Java/Go (alternatives): Java suits enterprise stability; Go suits high concurrency, but costs more to learn for beginners.
4.2 Recommended Entry-Level Tools
4.2.1 Basic Libraries
- Requests: send HTTP requests with headers/cookies
- Beautiful Soup: beginner-friendly HTML parsing
- lxml: high-performance parsing + XPath
- pandas: cleaning + export to CSV/Excel
4.2.2 Crawler Frameworks
Scrapy is a popular Python crawling framework with built-in scheduling, concurrency, retry logic, and extensibility via middleware.
4.2.3 No-Code/Low-Code Tools
If you do not need complex custom logic, you can use no-code tools (e.g., Bazhuayu Collector, Houyi Collector). These tools use visual rules (drag-and-drop, clicking) without writing code.
4.3 Basic Usage of Scrapy Crawler Framework
4.3.1 What is Scrapy?
Scrapy is an open-source crawling framework in Python for efficient and structured data extraction. It provides a complete pipeline (request scheduling → parsing → storage) and suits larger data collection tasks.
4.3.2 Installation of Scrapy
pip install scrapy
scrapy startproject basic_scrapy_crawler
cd basic_scrapy_crawler
scrapy genspider product_crawler example.com
4.3.3 Writing a Basic Scrapy Crawler
Create a spider file under product_crawler/spiders/:
import scrapy
from basic_scrapy_crawler.items import BasicScrapyCrawlerItem
class ProductCrawlerSpider(scrapy.Spider):
name = "product_crawler"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/mobile"]
def parse(self, response):
product_items = response.xpath('//div[@class="product-item"]')
for item in product_items:
product_item = BasicScrapyCrawlerItem()
product_item["name"] = item.xpath('.//h3[@class="product-name"]/text()').get(default="").strip()
product_item["price"] = item.xpath('.//span[@class="product-price"]/text()').get(default="").strip()
if product_item["name"] and product_item["price"]:
yield product_item
next_page = response.xpath('//a[@class="next-page"]/@href').get()
if next_page:
next_page_url = response.urljoin(next_page)
yield scrapy.Request(url=next_page_url, callback=self.parse, dont_filter=True)
Example project settings:
ROBOTSTXT_OBEY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
DOWNLOAD_DELAY = 2
ITEM_PIPELINES = {
'basic_scrapy_crawler.pipelines.BasicScrapyCrawlerPipeline': 300,
}
FEEDS = {
'mobile_products_scrapy.csv': {
'format': 'csv',
'fields': ['name', 'price'],
'overwrite': True,
'encoding': 'utf-8-sig'
}
}
Run the crawler:
scrapy crawl product_crawler
If you are considering which framework or library to use at scale, a comparison of entry-level crawling tools vs production frameworks can clarify when to adopt each.
V. Key Processes and Precautions for Data Collection
5.1 Core Processes of Data Collection
- Requirement analysis (what data, from where, how much, for what)
- Target website analysis (structure, data loading, robots.txt)
- Crawler development (tools, headers, delays, storage)
- Testing and debugging (small-scale validation)
- Batch crawling (monitoring, stability checks)
- Data cleaning (deduplication, normalization, missing values)
5.2 Core Precautions: Compliance and Anti-Crawling Response
5.2.1 Principles of Compliance
- Follow robots.txt (e.g., https://www.yahoo.com/robots.txt)
- Respect terms of service
- Protect IP/copyright/privacy
- Control request frequency to avoid harming servers
For a legal and ethical overview of what web crawling is permitted, refer to relevant guidelines such as the robots.txt standard and privacy-focused regulations.
5.2.2 Common Anti-Crawling Mechanisms and Countermeasures
- User-Agent verification → set a realistic UA
- IP blocking → rotate proxies + control per-IP rate
Related proxy details:
- Understanding SOCKS5 Proxies
- Building a Proxy Pool
- Cookie verification → maintain session cookies
- Dynamic loading (AJAX) → call JSON endpoints or use Selenium/Playwright
- CAPTCHA → OCR for simple cases; complex CAPTCHAs may need manual input or services
OCR material: Train Your Own OCR Model from Scratch with PaddleOCR
VI. Summary and Advanced Directions
6.1 Web Crawling Basics: Summary
Web crawling and data collection basics focus on automatically collecting legal and compliant public data. The standard workflow is:
seed URLs → requests → parsing → storage → iteration
Core components include URL manager, downloader, parser, storage, and scheduler. Beginners can start with Python + Requests + Beautiful Soup, then move to Scrapy for stronger scalability and maintainability.
6.2 Advanced Directions
- Distributed crawlers (Redis/message queues for multi-node crawling)
- Dynamic rendering (Selenium/Playwright)
- Anti-bot and counter-strategies (fingerprints, proxies, rate control)
- Data visualization and analytics (Matplotlib, Tableau)
When moving beyond custom crawlers, consider how web scraping APIs can offload anti-bot handling and offer scalable, production-grade data access.
Related Guides
- Web Crawler Technology: Principles, Architecture, Applications, and Risks
- Crawling HTML Pages: Python Web Scrawler Tutorial
- Build a high-performance crawler with Rust async
- Analysis of Rust Async Principles
- FRP Intranet Penetration for Web Crawling: Expose Internal Services Safely
- ZeroTier Intranet Penetration for Web Crawling: No Public IP Required (Part 2)
- Tailscale Intranet Penetration for Web Crawling: Zero-Config Remote Access (Part 3)
- HTTP Protocol: The Invisible Foundation of the Internet
- HTTP Request Methods
- HTTP Status Codes Explained: Meanings and Real-World Use