Big Data Computing: Real-Time Processing

In the previous article, [Big Data Computing：Batch Processing](https://xx/Big Data Computing：Batch Processing), we explored the principles, architecture, frameworks, and application scenarios of batch processing, along with its limitations.
While batch processing excels in high throughput, large-scale data handling, and accuracy, it suffers from high latency.
For scenarios requiring second- or minute-level statistics, batch processing falls short. To meet latency-sensitive business needs, real-time processing emerged — and is evolving rapidly.

This article focuses on real-time computing, examining its architecture and mainstream frameworks.

What Is Real-Time Processing?

Real-time processing refers to a computation model for continuous data streams.
Instead of waiting for data to accumulate in batches (as in batch processing), real-time systems process data immediately after it is generated, producing results within a defined short time window.

Core concepts include:

Continuous Data Generation
As long as the source system runs, data keeps flowing — for example, application logs or sensor readings from IoT devices.
Real-Time Data Transmission
Newly generated data is transmitted immediately to the processing system via WebSocket, HTTP APIs, or messaging systems.
Immediate Data Processing
Upon arrival, data is parsed, filtered, aggregated, or joined in real time.
Instant Output
Processed results can be sent to another real-time system or stored in a database for consumption.

Key characteristics:

Unbounded Data
Data streams are continuous and theoretically infinite, often referred to as “unbounded datasets.”
Low Latency
Time from data generation to result consumption is minimal — often in seconds or milliseconds. Since data value decays over time, minimizing latency is critical.
Continuous Computation
With infinite input, the system must process data continuously.
Incremental Processing
Due to the unbounded nature of streams, systems typically process data incrementally instead of storing the entire dataset.
High Availability
Downtime can cause data loss and inaccurate results, so real-time systems must be resilient and able to recover automatically from failures.

Real-Time Processing Architecture

A typical real-time computing architecture consists of the following layers:

Data Source – Generates raw data, such as databases or application logs.
Data Ingestion – Collects data in real time from business systems, logging services, or databases (e.g., Flume, Canal).
Message Queue – Buffers and transports data streams (e.g., Kafka, Pulsar).
Compute Engine – The core processing component, supporting windowed computation, state management, etc. (e.g., Storm, Flink).
Resource Management & Scheduling – Manages cluster resources and schedules tasks in distributed environments (commonly YARN).
Result Storage – Persists output (e.g., Redis, ClickHouse).
Monitoring & Visualization – Tracks key metrics to quickly detect and resolve anomalies.

Real-Time Processing Frameworks

Spark Structured Streaming
Built on Spark’s in-memory engine, introduces micro-batch processing to achieve second-level latency.
Storm
A standalone distributed computation framework using topology-based workflows. It’s not part of the Hadoop ecosystem, and its ecosystem growth has slowed.
Flink
A native stream processing engine designed for real-time workloads, supporting event time, stateful computation, and windowing. It achieves millisecond-level latency.

Application Scenarios

Real-time computing has many use cases, driving its rapid adoption:

Real-Time Recommendation
Personalizes recommendations instantly based on a user’s current behavior, improving accuracy and relevance.
Fraud Detection & Risk Control
Monitors sensitive behaviors or transactions in real time to detect anomalies early and minimize potential loss.
Real-Time Analytics
Tracks metrics such as sales volume or online user counts in real time, enabling instant decision-making and maximizing data value.

Limitations & Challenges

While mature in many respects, real-time computing still faces challenges:

Out-of-Order Data
Network issues can cause events to arrive out of sequence, requiring mechanisms to restore correct ordering.
Complex State Management
Long-running jobs must maintain state over time and restore it during recovery to ensure accuracy.
System Complexity
Real-time pipelines are often long and latency-sensitive, increasing operational complexity.

Conclusion

Real-time processing continuously analyzes data streams with low latency, making it an essential complement to batch processing.
It plays an irreplaceable role in latency-sensitive domains such as real-time recommendation and fraud detection.

In the next article, we will compare batch processing and real-time processing to deepen understanding of their respective strengths and trade-offs.