In the previous article,https://dataget.ai/wp-admin/post.php?post=543&action=edit,
we broke down big data technology into three core components: storage, computing, and querying.
Among these, big data storage forms the foundation of the entire architecture. It must provide distributed, scalable, and fault-tolerant capabilities to support massive data volumes reliably.
HDFS (Hadoop Distributed File System) plays exactly this role.
As the core storage component of the Hadoop ecosystem, HDFS acts as the distributed data backbone for large-scale batch processing systems.
In this article, we examine HDFS in depth by covering its design principles, architecture, data storage model, performance optimizations, and limitations, revealing how HDFS works and why it became a cornerstone of big data storage.
Design Principles of HDFS
HDFS was designed to solve the problem of reliable storage and efficient access to massive datasets. At the same time, it serves as the underlying data layer for MapReduce and other big data computing engines.
To achieve these goals, HDFS follows several clear design principles:
- Batch-Oriented Processing HDFS optimizes for write-once, read-many workloads, which aligns well with offline analytics and batch computing.
- Large File Preference HDFS targets large files (typically gigabytes or larger). Consequently, it does not perform well when handling large numbers of small files.
- Sequential Data Access The system favors high-throughput sequential reads and writes. In contrast, it performs poorly for frequent random access.
- High Fault Tolerance HDFS relies on multi-replication. It distributes replicas across different nodes and racks, which significantly reduces the risk of data loss caused by node or rack failures.
- Horizontal Scalability HDFS adopts a master–slave architecture. Therefore, engineers can expand storage capacity simply by adding more DataNodes.
- Data Locality Instead of moving large datasets across the network, HDFS moves computation closer to where the data resides, reducing network overhead and improving performance.
HDFS Architecture
HDFS uses a classic master–slave architecture, which separates metadata management from actual data storage.
Core Components
- NameNode The NameNode acts as the master node. It manages the file system namespace and tracks metadata, such as file names, directory structures, block locations, and replication information. Importantly, the NameNode does not store actual file data.
- DataNode DataNodes act as worker nodes. They store file blocks on local disks and periodically send heartbeats and block reports to the NameNode.
- Client Client applications interact with HDFS through APIs or CLI commands. They communicate with the NameNode for metadata and directly with DataNodes for data transfer.
High Availability Evolution
As data volume grows, metadata stored in the NameNode also increases. This growth can create a single point of failure and a scalability bottleneck.
Initially, Hadoop introduced the Secondary NameNode to periodically merge metadata snapshots and edit logs.
However, the Secondary NameNode does not serve as a hot standby. Therefore, NameNode failure would still bring down the entire cluster.
To solve this issue, modern HDFS supports Active–Standby NameNode configurations. In this setup:
- One NameNode runs in active mode
- Another runs in standby mode
- If the active NameNode fails, the standby immediately takes over
As a result, HDFS achieves true high availability.
HDFS Data Storage Model
HDFS stores data on DataNodes using a block-based storage model.
Key Storage Mechanisms
- Block Splitting HDFS divides large files into fixed-size blocks (default: 128 MB) and distributes them across different DataNodes.
- Replica Placement Strategy The default replication factor is 3. Typically:
- Two replicas reside on different nodes within the same rack
- One replica resides on a node in a different rack This design balances reliability and network efficiency.
- Write Workflow The client requests a write → the NameNode returns a list of DataNodes → the client writes data to the first DataNode → data pipelines to the remaining replicas.
- Read Workflow The client requests metadata from the NameNode → receives block locations → reads data from the nearest DataNode.
HDFS Performance Optimizations
To handle massive data volumes efficiently, HDFS implements several performance optimizations:
- Data Locality Optimization Compute tasks run on nodes that already store the required data, which minimizes network transfer.
- Sequential I/O Optimization Large block sizes and streaming access maximize disk throughput.
- Pipeline Replication Replicas are written in a pipeline fashion, improving write efficiency.
- Short-Circuit Local Reads When possible, applications read local data directly from disk, bypassing the DataNode network stack to reduce latency.
Limitations and Challenges of HDFS
Despite its strengths, HDFS has notable limitations:
- Poor Small File Support Metadata for each file resides in NameNode memory. Consequently, too many small files consume excessive memory while underutilizing disk space.
- No Low-Latency Random Access HDFS targets batch workloads, not real-time or interactive queries.
- NameNode Scalability Constraints Since metadata resides in memory, NameNode scaling remains challenging. Federation partially addresses this issue but increases architectural complexity.
- Evolving Cloud Architecture In cloud-native environments, object storage systems increasingly replace HDFS for certain use cases, especially where compute and storage separation is preferred.
Conclusion
With its distributed, scalable, and fault-tolerant architecture, HDFS has become the cornerstone of big data storage. It continues to power offline analytics, batch computing, and large-scale data platforms worldwide.
However, as data processing requirements evolve, HDFS also faces new challenges.
Therefore, the Hadoop ecosystem continues to adapt, integrating HDFS with modern architectures and cloud-native storage solutions.
In the next article, we will move up the stack and explore big data computing, starting with batch processing, to understand how large-scale data is transformed into valuable insights.