In the previous article, [Big Data Computing:Batch Processing](https://xx/Big Data Computing:Batch Processing), we explored the principles, architecture, frameworks, and application scenarios of batch processing, along with its limitations.
While batch processing excels in high throughput, large-scale data handling, and accuracy, it suffers from high latency.
For scenarios requiring second- or minute-level statistics, batch processing falls short. To meet latency-sensitive business needs, real-time processing emerged — and is evolving rapidly.
This article focuses on real-time computing, examining its architecture and mainstream frameworks.
What Is Real-Time Processing?
Real-time processing refers to a computation model for continuous data streams.
Instead of waiting for data to accumulate in batches (as in batch processing), real-time systems process data immediately after it is generated, producing results within a defined short time window.
Core concepts include:
- Continuous Data Generation
As long as the source system runs, data keeps flowing — for example, application logs or sensor readings from IoT devices. - Real-Time Data Transmission
Newly generated data is transmitted immediately to the processing system via WebSocket, HTTP APIs, or messaging systems. - Immediate Data Processing
Upon arrival, data is parsed, filtered, aggregated, or joined in real time. - Instant Output
Processed results can be sent to another real-time system or stored in a database for consumption.
Key characteristics:
- Unbounded Data
Data streams are continuous and theoretically infinite, often referred to as “unbounded datasets.” - Low Latency
Time from data generation to result consumption is minimal — often in seconds or milliseconds. Since data value decays over time, minimizing latency is critical. - Continuous Computation
With infinite input, the system must process data continuously. - Incremental Processing
Due to the unbounded nature of streams, systems typically process data incrementally instead of storing the entire dataset. - High Availability
Downtime can cause data loss and inaccurate results, so real-time systems must be resilient and able to recover automatically from failures.
Real-Time Processing Architecture
A typical real-time computing architecture consists of the following layers:
- Data Source – Generates raw data, such as databases or application logs.
- Data Ingestion – Collects data in real time from business systems, logging services, or databases (e.g., Flume, Canal).
- Message Queue – Buffers and transports data streams (e.g., Kafka, Pulsar).
- Compute Engine – The core processing component, supporting windowed computation, state management, etc. (e.g., Storm, Flink).
- Resource Management & Scheduling – Manages cluster resources and schedules tasks in distributed environments (commonly YARN).
- Result Storage – Persists output (e.g., Redis, ClickHouse).
- Monitoring & Visualization – Tracks key metrics to quickly detect and resolve anomalies.
Real-Time Processing Frameworks
- Spark Structured Streaming
Built on Spark’s in-memory engine, introduces micro-batch processing to achieve second-level latency. - Storm
A standalone distributed computation framework using topology-based workflows. It’s not part of the Hadoop ecosystem, and its ecosystem growth has slowed. - Flink
A native stream processing engine designed for real-time workloads, supporting event time, stateful computation, and windowing. It achieves millisecond-level latency.
Application Scenarios
Real-time computing has many use cases, driving its rapid adoption:
- Real-Time Recommendation
Personalizes recommendations instantly based on a user’s current behavior, improving accuracy and relevance. - Fraud Detection & Risk Control
Monitors sensitive behaviors or transactions in real time to detect anomalies early and minimize potential loss. - Real-Time Analytics
Tracks metrics such as sales volume or online user counts in real time, enabling instant decision-making and maximizing data value.
Limitations & Challenges
While mature in many respects, real-time computing still faces challenges:
- Out-of-Order Data
Network issues can cause events to arrive out of sequence, requiring mechanisms to restore correct ordering. - Complex State Management
Long-running jobs must maintain state over time and restore it during recovery to ensure accuracy. - System Complexity
Real-time pipelines are often long and latency-sensitive, increasing operational complexity.
Conclusion
Real-time processing continuously analyzes data streams with low latency, making it an essential complement to batch processing.
It plays an irreplaceable role in latency-sensitive domains such as real-time recommendation and fraud detection.
In the next article, we will compare batch processing and real-time processing to deepen understanding of their respective strengths and trade-offs.