Build a high-performance crawler with Rust async

File 1
pasted_text_0.txt

In the previous three articles (Asynchronous Programming with async, [Deep Dive into Rust’s async Mechanism](https://xx/Rust async Technical Deep Dive), and Practical Applications of async), we progressively explored asynchronous programming—from its fundamental concepts to Rust’s async implementation and real-world applications. In this article, we continue with the theme of asynchronous programming by applying the discussed concepts to a practical project: building a high-performance web crawler using Rust async.

Why Rust?

Rust is a modern systems programming language focused on safety, speed, and concurrency. Its performance rivals C/C++, and its built-in concurrency model, combined with Tokio’s mature async runtime, effortlessly supports millions of concurrent coroutines. Additionally, Rust guarantees memory and thread safety at compile time, eliminating GC-related risks and avoiding common issues like memory leaks and crashes in crawler systems. Thus, Rust is exceptionally well-suited for developing high-performance crawlers.

However, these advantages also come with a learning curve. If you only need to scrape a few pages without high performance demands, Rust might not be the best choice—Python or Go could meet your needs more quickly.

The following dependencies are recommended for building a crawler system in Rust:

Requirement	Recommended Crate
Async Runtime	tokio
HTTP Client	reqwest
HTML Parsing	scraper
URL Deduplication	dashmap

Architecture Design

To construct a high-performance async crawler system, we modularize its design into the following components:

Requester: Handles HTTP request construction, including custom User-Agent and other request-related configurations.
Fetcher: Executes HTTP requests, manages concurrency, and implements delayed requests.
Parser: Processes HTML responses, extracts target data, and handles new links.
Pipeline: Asynchronously stores structured data.
Scheduler: Manages task scheduling for the crawler.

Module Implementation

Requester

The Requester module constructs standardized HTTP requests using reqwest::RequestBuilder to set headers, User-Agent, and other attributes.

Define a Requester struct with properties like user_agent and delay_ms (for request delays). Example code:

use reqwest::{Client, RequestBuilder};
use std::time::Duration;

#[derive(Clone)]
pub struct Requester {
    client: Client,
    pub user_agent: String,
    pub delay_ms: u64,
}

impl Requester {
    pub fn new(user_agent: &str, delay_ms: u64) -> Self {
        let client = Client::builder()
            .user_agent(user_agent)
            .build()
            .unwrap();

        Self {
            client,
            user_agent: user_agent.to_string(),
            delay_ms,
        }
    }

    pub fn build_request(&self, url: &str) -> RequestBuilder {
        self.client.get(url)
    }
}

Fetcher

The Fetcher asynchronously executes HTTP requests using tokio and reqwest.

Define a Fetcher struct that utilizes Requester to perform requests. Example code:

use crate::requester::Requester; 
use tokio::time::{sleep, Duration};

pub struct Fetcher;

impl Fetcher {
    pub async fn fetch(requester: &Requester, url: &str) -> Option<String> {
        sleep(Duration::from_millis(requester.delay_ms)).await;

        let request = requester.build_request(url);

        match request.send().await {
            Ok(resp) => match resp.text().await {
                Ok(body) => Some(body),
                Err(e) => {
                    eprintln!("failed to read response: {e:?}");
                    None
                }
            },
            Err(e) => {
                eprintln!("failed to request: {e:?}");
                None
            }
        }
    }
}

Parser

The Parser processes HTML responses to extract target data and new links, leveraging scraper for parsing.

To accommodate diverse websites, abstract the Parser as a trait and implement specific parsers for different tasks. Example code:

use async_trait::async_trait;

#[async_trait]
pub trait Parser: Send + Sync {
    async fn parse(&self, html: &str) -> ParseResult;
}

pub struct ParseResult {
    pub data: Vec<String>,
    pub new_links: Vec<String>
}

Pipeline

The Pipeline asynchronously stores structured data, such as saving it to a database.

Given the variety of storage options, define Pipeliner as a trait. Example code:

use async_trait::async_trait;

#[async_trait]
pub trait Pipeline: Send + Sync {
    async fn process(&self, data: Vec<T>);
}

Scheduler

The Scheduler manages crawler task scheduling, serving as the core module. It uses tokio::sync::mpsc for task queues and dashmap for URL deduplication.

use tokio::sync::mpsc::{self, Sender, Receiver};
use dashmap::DashSet;
use std::sync::Arc;

pub struct Scheduler {
    seen: Arc<DashSet<String>>,
    sender: Sender<String>,
}

impl Scheduler {
    pub fn new(sender: Sender<String>) -> Self {
        Self {
            seen: Arc::new(DashSet::new()),
            sender,
        }
    }

    pub fn try_enqueue(&self, url: String) {
        if self.seen.insert(url.clone()) {
            let _ = self.sender.try_send(url);
        }
    }
}

Robustness Enhancements

While the above modules form the foundation, ensuring long-term stability and efficiency requires addressing the following aspects:

Retry Mechanism: Automatically retry failed requests due to network instability or invalid links, while logging failures.
Task Isolation: Use tokio::spawn to isolate tasks, preventing interference between them.
URL Validation: Pre-crawl URL validation to avoid unnecessary retries and ensure link integrity.

Performance Optimization

As the crawler scales, performance can be enhanced through:

Connection Reuse: Leverage reqwest‘s built-in hyper connection pooling for TCP reuse.
Multi-Task Scheduling: Split task queues across modules to improve throughput.
Distributed Deployment: Scale horizontally with multi-instance deployment, distributing URLs via message queues.
Incremental Crawling: Save crawl progress to resume efficiently after failures.

Conclusion

Rust’s focus on safety, speed, and concurrency provides system-level performance and memory safety guarantees. Coupled with its vibrant ecosystem, Rust empowers developers to build high-performance, stable, and scalable crawler systems.

This article dissected a crawler system into modular components, addressed robustness challenges, and proposed optimizations to meet evolving demands.