What is HDFS?

About 3117 wordsAbout 10 min

Distributed Storage Big Data Distributed

2025-03-05

If you are also a big data worker, you must have heard of the distributed file system: . This article will introduce , and elaborate on its important concepts, architecture, storage principles, and reading and writing processes. If you want to understand this system from scratch, you can read this article.

Introduction

In order to solve the problem of efficient storage under large amounts of data, Google developed a distributed file system: , which realizes the distributed storage of files in multiple states and on the network. The Hadoop distributed file system: is an open source implementation of , and together with MapReduce, it becomes the two core components of Haddop. In general, achieves the following goals:

Compatible with cheap hardware devices. DFS regards the failure of hardware nodes as "normal", designs a mechanism for rapid detection of emergency failures and automatic recovery, and also opens up continuous monitoring, fault-tolerant processing, etc. It can achieve data integrity even in the case of hardware errors.
Streaming data reading and writing.
Large data sets.
Simple file model. uses a simple file model of "write once, read multiple times". Once a file is written and closed, it can only be read.
Strong cross-platform compatibility. is implemented in Java language and has good cross-platform compatibility. Any machine that supports JVM can run .

However, also has certain limitations:

Not suitable for low-latency access. mainly faces large-scale batch processing of data and uses streaming data reading. It has a high throughput rate, but it also means high latency.
Unable to efficiently store a large number of small files. has a large block, and if the file size is less than one block, it cannot be stored efficiently. The specific reasons are discussed in detail later in the article.
Does not support multi-user writing and arbitrary modification of files. only allows one file writer and does not allow multiple users to write to the same file. And it cannot perform random write operations, only append.

There are several very key concepts in , including: blocks, name nodes, data nodes, and second name nodes.

Blocks

Like traditional file systems, divides files into different blocks, and the default size of a block is 64 MB. It can be seen that compared to the several thousand bytes of traditional file systems, blocks are much larger. The advantage of doing so is that it can minimize addressing overhead. Benefits of using the abstract concept of blocks:

Support large-scale file storage. Large-scale files are divided into different blocks and can be stored on different machines separately, without being limited by the local storage capacity of the node.
Simplify system design. First, because the size of the block is fixed, storage management is simplified. Second, metadata and file libraries do not need to be stored together, which facilitates metadata management.
Suitable for data backup.

Name Node and Data Node

There are two types of nodes in : Name Node and Data Node.

The name node is responsible for managing the namespace of the distributed file system and saves two core data structures: FsImage and EditLog. Among them, FsImage is used to maintain the metadata of all files and folders in the file tree and the file tree. EditLog records all operations such as creation, deletion, and renaming of files.

Data nodes are working nodes, responsible for data storage and reading. They store and retrieve data according to the scheduling of the client or name node, and regularly send a list of their stored blocks to the name node. The data of each data node will be saved in the local Linux file system of each node.

How do FsImage and EditLog work?

When the name node starts, it will load the contents of FsImage into memory, and then execute all operations in EditLog to keep all metadata in memory up to date. After completing all this, the name node will create a new FsImage file and an empty EditLog file. At this time, the name node starts successfully and enters normal working state.

After that, all update operations will be written to EditLog. The reason for not writing to FsImage is that the size of FsImage is usually large ( usually more than GB ) . If all update operations are written to FsImage, the update operation will be very slow. The EditLog is usually very small and can be written efficiently.

We call the process of loading FsImage and EditLog during the startup of the name node "safe mode", which can only provide read operations to the outside world, but not write operations. After the startup is completed, exit the "safe mode" and enter the normal operation state to provide read and write operations to the outside world.

Second Name Node

The role of FsImage and EditLog has been explained above. Obviously, as the name node continues to run, the EditLog will become larger and larger, making the writing process extremely slow. Not only that, when the node restarts, all operations in the EditLog need to be re-executed. If the EditLog is too large, the name node will be in "safe mode" for too long, and it will not be able to provide write operations to the outside world normally, affecting the normal use of users.

In order to effectively solve a series of problems caused by the large EditLog, uses the Secondary NameNode in its design. It has two functions:

Complete the merge of EditLog and FsImage. Reduce the size of EditLog and shorten the restart time of the name node.
As the "checkpoint" of the name node, save the metadata information of the name node.

Now we explain these two functions one by one:

Merge of EditLog and FsImage

Every once in a while, the second name node will communicate with the name node and request it to stop the EditLog file ( assuming that it is $t_1$ at this moment ) . At this time, the name node will temporarily add the newly arrived write operation to the new file EditLog.new.
The second name node pulls the FsImage file and EditLog file in the name node back to the local, loads them into the memory and merges the two files in the memory, that is, executes EditLog one by one in the memory to make FsImage in the latest state.
After the merge is completed, the second name node will send the merged FsImage file back to the name node.
The name node will replace the original old FsImage file with the new FsImage, and replace EditLog with EditLog.new( assuming that it is $t_2$ at this moment ) .

Through the above operations, we have effectively reduced the size of EditLog.

Checkpoint as Name Node

From the above merge operation, we can know that the second name node will communicate with the name node regularly and pull EditLog and FsImage files. So we can regard the second name node as the "checkpoint" of the name node, and periodically back up the metadata information in the name node. When the name node fails, we can use the metadata in the second name node to recover the system.

But obviously, the new FsImage file obtained by the merge operation on the second name node does not contain the update operations that occurred between $t_1$ and $t_2$ , so if the name node fails between $t_1$ and $t_2$ , the system will lose some metadata information. In , the system does not support direct switching to the second name node, so the second name node is just a checkpoint and will not achieve the effect of hot backup. The metadata of the name node is still at risk of data loss.

Architecture

In this section, we will briefly introduce The architecture of is introduced first, followed by namespace management, communication protocols, and clients. Finally, the limitations of the system are discussed.

Overview

adopts a master-slave structure model. An cluster contains one name node ( one and only one ) and several data nodes.

The name node is the central server responsible for managing the namespace of the file system and client access to files. A data node is generally a node running a data node process, responsible for the read and write requests of the file system client, and performs operations such as creation, deletion, and replication of data blocks under the unified scheduling of the name node. The data of each data node is actually stored in its local Linux file system. The data node will send a heartbeat to the name node to report its status. The data node that does not send a heartbeat on time will be marked as "downtime" and will no longer be assigned any I/O requests.

Inside the system, files will be divided into several data blocks, which will be distributed to several data nodes. When a client needs to access a file, it first sends the file name to the name node. The name node will find all the corresponding data blocks based on the file name, and then find the data nodes that actually store these data blocks based on the information of each data block, and return the data node location to the client. Finally, the client directly accesses these data nodes to obtain data. As you can see, the name node does not directly participate in the data transmission in this process.

Namespace management

The namespace of includes: directories, files, and blocks. Namespace management means that the namespace supports basic operations such as creating, modifying, and deleting directories, files, and blocks in similar to file systems.

In the current architecture, == there is only one namespace in the entire cluster, and there is only one name node ==, which is responsible for managing this namespace.

Limitations

Currently, has not implemented functions such as disk quotas and file access permissions, nor does it support hard links and soft links for files.

Communication Protocol

The communication protocols of are all built on the TCP/IP protocol. The client initiates a TCP link to the name node through a configurable port and communicates with the name node using the client protocol. The name node and data node communicate using the data node protocol. The client and data node use to implement.

Client

The client is the most common way for users to operate , but strictly speaking, the client is not a part of . It can support common operations such as opening, reading, and writing, and can also use shell commands to access data.

Limitations

has only one name node, which simplifies the system design while bringing some problems:

Namespace limitation. The name node is stored in memory and is limited by hardware memory.
Performance bottleneck. The throughput of the entire distributed system is limited by the throughput of a single name node.
Isolation problem. Since there is only one name node and one namespace in the cluster, different programs cannot be isolated.
Cluster availability. Once the name node fails, the entire cluster will become unavailable.

Storage principle

Redundant storage of data

In order to ensure the fault tolerance and availability of the system, uses multiple copies to store data redundantly. Usually, multiple copies of a data block will be distributed to different data nodes. This multi-copy method has the following advantages:

Speed up data transmission. When multiple clients need to access the same file, clients can read data from different copies.
Easy to check data errors.
Ensure data reliability.

Data access strategy

The data access strategy is the core content of the distributed file system, which will greatly affect the read and write performance of the entire distributed file system.

Data storage

uses a rack-based data storage strategy.

The difference between nodes in the same rack and different racks

Data communication between different racks needs to go through switches or routers, while communication between different machines in the same rack does not need to go through switches and routers.

This means that the communication bandwidth between different machines in the same rack is larger than that between different racks.

The default policy of is that each data node is on a different rack. This has both advantages and disadvantages:

Advantages

High reliability. If a rack fails, the replicas on other racks can be used.
High reading speed. Multiple replicas can be read in parallel.
Easier to implement load balancing and error handling.

Disadvantages

The bandwidth within the same rack cannot be fully utilized when writing data.
The default replication factor of is 3. Each file block will be copied to three places at the same time, two of which are placed on different machines in the same rack, and the third copy is placed on a machine in a different rack.
Data reading.
provides an API to determine the rack ID to which a data node belongs. The client can call the API to obtain the rack ID to which it belongs. When the client reads data, it obtains the storage location list of different copies of the data block from the name node, and can call the API to determine the rack ID to which the client and these data nodes belong. If they are in the same rack, the copy can be read first.
Data replication.
data replication adopts a pipeline replication strategy. When the client writes a file:
1. First, the file is written locally and divided into different data blocks.
2. Each block initiates a write request to the name node.
3. The name node selects a data node list based on the usage of the data node and returns it to the client.
4. The client first writes the data to the first data node in the list and passes the list to the first data node.
5. When the first data node receives 4 KB of data, it writes it locally and initiates a connection request to the second data node in the list, passing the 4 KB of data it has received to the second data node.
6. The second data node also follows this and requests the third data node when it receives 4 KB, forming a pipeline.
When the file is written, the data replication is also completed at the same time.

Data Error and Recovery

has a very high fault tolerance, making it compatible with cheap hardware devices. regards hardware device errors as a normal state rather than an exception, and has designed corresponding mechanisms to detect data errors and perform automatic recovery. It is divided into the following three situations:

Name node error.
The name node stores all metadata information, including the most core EditLog and FsImage. If it is damaged, the entire instance will fail.
Hadoop uses two mechanisms to ensure the security of the name node:
1. Synchronize the metadata information of the name node to other file systems. For example, remotely mount NFS.
2. Run a second name node.
Data node error.
Each data node will periodically send "heartbeat" information to the name node to report its status to the name node. When a data node fails or the network is disconnected, the name node will mark the data node that cannot receive the "heartbeat" as "downtime", and all data on the node will be marked as "unreadable". The name node will no longer send any I/O requests.
After marking "downtime", the number of copies of some data blocks will be less than the redundancy factor. At this time, the name node will regularly detect this situation, and once it is found, it will start data redundancy replication.
Data error.
Network transmission and disk can cause data errors. After reading the data, the client will use md5 and shal to verify the data to ensure that the correct data is read. The information verified here is written by the client to the hidden folder of the same path when the file is created.
When the client verification fails, the client will request a copy of the data block from other data nodes and report the data block error to the name node. The name node will periodically check and re-copy the data block.

References for this article

《Principles and Applications of Big Data Technology ( 2nd Edition ) 》

Contributors

dingyuqi

Changelog

9/16/25, 12:04 PM

View All Changelog

d7626-fix(docs): mermaid grammaron 9/16/25
eb6eb-improve(docs): use pangu formatteron 8/5/25
f0b05-feat(docs): add a new english articleon 3/5/25

Copyright

License under:Attribution-NonCommercial-NoDerivatives 4.0 International (CC-BY-NC-ND-4.0)