post thumbnail

Big Data Storage: HDFS

Discover HDFS - Hadoop's distributed file system for big data storage. Learn its master-slave architecture, block storage design (128MB chunks), and 3-way replication for fault tolerance. Understand HDFS strengths (scalability, batch processing) and limitations (small file inefficiency). Essential reading for data engineers working with large datasets in Hadoop ecosystems.

2025-09-09

Big Data Storage: HDFS

In the previous article, [Deconstructing Big Data:Storage, Computing, and Querying](https://xx/Deconstructing Big Data:Storage, Computing, and Querying), big data technology was deconstructed into three components: storage, computing, and querying.
Among them, storage serves as the foundation of big data, featuring distributed, scalable, and fault-tolerant capabilities, providing reliable large-scale data storage services.

HDFS (Hadoop Distributed File System), as a core component of Hadoop, acts as the distributed storage backbone in the Hadoop ecosystem and is a representative component of big data distributed storage.
This article focuses on HDFS, offering an in-depth look at its design principles, architecture, data storage, performance optimizations, and limitations & challenges to reveal its working mechanisms and technical features.

Design Principles

HDFS, as a component of Hadoop, was designed primarily to meet the needs of large-scale data storage, solving the challenges of reliable persistence and efficient access, and serving as the data layer for MapReduce.
Its design follows several core principles:

Architecture

HDFS adopts a master–slave architecture:

As the number of stored files increases, the corresponding metadata in the namespace grows, potentially creating a NameNode bottleneck.
To address this, a Secondary NameNode was introduced to periodically merge the NameNode’s metadata snapshots with its edit logs, preventing excessive log growth.

However, the Secondary NameNode is not a hot standby. If the NameNode fails, the entire cluster becomes unavailable because the Secondary NameNode cannot take over its services.
To improve availability, HDFS now supports active–standby NameNode configurations, where two NameNodes act as hot backups for each other. If the active NameNode fails, the standby immediately takes over and continues service.

Data Storage

In HDFS, data is stored on DataNodes in the form of blocks. The storage process includes:

Performance Optimizations

To support massive data storage and retrieval, HDFS implements several optimizations:

Limitations & Challenges

While HDFS is the distributed storage foundation of the Hadoop ecosystem, it has some limitations:

Conclusion

With its distributed, scalable, and fault-tolerant features, HDFS has become the cornerstone of big data storage.
However, it faces certain limitations. To address these, HDFS continues to evolve, adapting to new data processing needs.