In the previous article [A Closer Look at the Evolution of Databases](https://xx/A Close Look at the Evolution of Databases), we introduced how databases have continuously evolved in response to growing demands and increasing data volumes. However, in today’s digital era, data is growing at an exponential scale. Traditional database technologies can no longer meet the requirements of this magnitude in terms of storage and query capabilities. This gave rise to Big Data technologies.
Big Data is not a single system or tool, but rather a technology ecosystem that encompasses storage, computing, and querying. These three components work together to enable effective use of massive datasets, making Big Data a key driver for social progress and enterprise innovation.
Characteristics of Big Data
With the rapid growth of mobile devices and IoT, data is generated 24/7, leading to exponential growth. Such data often has the following characteristics:
- Volume: From terabytes (TB) to petabytes (PB), and even exabytes (EB).
- Variety: A mix of structured, semi-structured, and unstructured data.
- Velocity: Requires fast ingestion and real-time response with ever-lower latency.
- Cost: Storage and computing must scale at controllable hardware costs.
These challenges make traditional databases insufficient, requiring an entirely new set of solutions. Below, we break it down into storage, computing, and querying.
Big Data Storage
The first priority of Big Data is storage. Unlike traditional single-node databases, Big Data storage must be distributed, scalable, and fault-tolerant. A common approach is distributed storage, where data is split into chunks and replicated across multiple machines. This not only supports larger scales but also ensures availability even if some nodes fail.
Typical storage technologies include:
- HDFS (Hadoop Distributed File System): The classic Big Data storage system, featuring redundancy and horizontal scalability, providing high throughput, and serving as a cornerstone of the ecosystem.
- Object Storage: Such as Amazon S3, Alibaba Cloud OSS, and Ozone—mainstream choices in cloud-native environments, supporting massive unstructured data with elastic scaling and pay-as-you-go billing.
- NoSQL Databases: Such as HBase and Cassandra, mainly used for structured and semi-structured data.
In essence, the mission of Big Data storage is: store more, store longer, store reliably.
Big Data Computing
Stored data is meaningless unless processed, yet processing cannot rely on traditional single-node approaches, as data value diminishes over time. Single-node processing is too slow for massive datasets. Thus, the core challenge of Big Data computing is: how to process huge datasets quickly and efficiently.
Representative computing technologies include:
- Batch Processing: Represented by MapReduce, suitable for historical data processing (e.g., T+1 reports), later evolving into Spark.
- Stream Processing: Represented by Flink, supporting millisecond-level real-time analytics.
- Unified Batch and Stream Processing: Combining both paradigms to serve multiple scenarios. For example, Flink unifies batch and stream, simplifying architecture and lowering maintenance costs.
Big Data computing unlocks the value of data, with the core goals being: processable, fast, and accurate.
Big Data Querying
While computing answers how data is processed, end users care more about how data can be accessed and used. This is where querying comes into play.
Typical Big Data query technologies include:
- SQL on Hadoop: e.g., Hive, which translates SQL into MapReduce or Spark jobs, providing database-like usability.
- Distributed Query Engines: e.g., Presto, Trino, Impala—supporting low-latency, interactive queries, commonly used in BI analytics.
- MPP Databases & OLAP Engines: e.g., ClickHouse, Doris, Kylin—suitable for multidimensional analysis and real-time reporting.
- Lakehouse Querying: Next-gen data lake technologies (e.g., Iceberg, Delta Lake) combined with Presto/Spark, supporting both real-time queries and offline analysis.
If storage is the foundation and computing is the engine, then querying is the user-facing window. Its mission is: usable, fast, and convenient.
Storage, Computing, and Querying
These three pillars form a tightly integrated Big Data ecosystem:
- Storage is the foundation
Without scalable and reliable storage, there’s no data to compute or query. Moreover, storage design directly impacts performance in computing and querying. - Computing is the bridge
Computing transforms raw data into valuable assets, enabling efficient querying. - Querying is the window
Querying makes data accessible and actionable, ensuring that the results of storage and computation can be applied in practice.
Conclusion
This article deconstructed Big Data into three major components: storage, computing, and querying. Storage ensures reliable preservation of massive datasets, computing extracts value from them, and querying delivers that value directly to users. Together, they form a robust ecosystem that empowers Big Data technologies to play a vital role across industries.