HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179

KevinWikant · 2024-11-21T18:23:58Z

Problem

Problem background:

A datanode should only enter decommissioned state if all the blocks on the datanode are sufficiently replicated to other live datanodes.
This expectation is violated for Under Construction blocks which are not considered by the DatanodeAdminMonitor at all.
DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor StorageInfos. This is because:
- For a new HDFS block that was just created, it is not be added to the StorageInfos until the HDFS client closes the DFSOutputStream & the block becomes finalized
- For an existing HDFS block that was opened for append:
  - First, the block version with the previous generation stamp is marked stale & removed from the StorageInfos
  - Next, the block version with the new generation stamp is not be added to the StorageInfos until the HDFS client closes the DFSOutputStream & the block becomes finalized

There is logic in the DatanodeAdminManager/DatanodeAdminMonitor to avoid transitioning datanodes to decommissioned state when they have open (i.e. Under Construction) blocks:

hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java

Line 357 in cd2cffe

+ ", Is Open File: " + bc.isUnderConstruction()
hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java

Line 305 in cd2cffe

if (bc.isUnderConstruction() && block.equals(bc.getLastBlock())) {

This logic does not work correctly because, as mentioned above, DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor StorageInfos which does not include Under Construction blocks for which the DFSOutputStream has not been closed yet.

There is also logic in the HDFS DataStreamer client which will replace bad/dead datanodes in the block write pipeline. Note that:

this logic does not work if the replication factor is 1
if the replication factor is greater than 1, then this logic does not work if all the datanodes in the block write pipeline are decommissioned/terminated at around the same time

Overall, the Namenode should not be putting datanodes with open blocks into decommissioned state & hope that the DataStreamer client is able to replace them when the decommissioned datanodes are terminated. This will not work depending on the timing & therefore is not a solution which guarantees correctness.

The Namenode needs to honor the rule that "a datanode should only enter decommissioned state if all the blocks on the datanode are sufficiently replicated to other live datanodes", even for blocks which are currently Under Construction.

Potential Solutions

One possible opinion is that if the DFSOutputStream has not been successfuly closed yet, then the client should be able to replay all the data if there is a failure. The client should not have any expectation the data is committed to HDFS until the DFSOutputStream is closed. There are a few reasons I do not think this makes sense:

The methods hflush/hsync do not result in the data already appended to the DFSOutputStream being persisted/finalized. This is confusing when compared the standard experience of stream flush/sync methods.
This does not handle the case where a block is re-opened by a new DFSOutputStream after having been previously closed (by another different client). In this case, the problem will lead to data loss for data that was previously committed by another client & cannot be replayed.
- To solve this problem, we could try not removing old block version from StorageInfos when a new block version is created; however, this change is likely to have wider implications on block management.

Another possible option that comes to mind is to add blocks to StorageInfos before they are finalized. However, this change also is likely to have wider implications on block management.

Without modifying any existing block management logic, we can add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.

Solution

Add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.

Pros:

works for newly created HDFS block
works for re-opened HDFS block (i.e. opened for append)
works for block with any replication factor
does not change logic in BlockManager, the new UnderConstructionBlocks data structure & associated logic is purely additive

Implementation Details

Feature is behind a configuration "dfs.namenode.decommission.track.underconstructionblocks" which is disabled by default.
- When enabled, feature prevents HDFS data loss & write failures at the cost of potentially slowing down the decommissioning process.
- Customer who do not use HDFS decommissioning feature can choose to leave the feature disabled as they will not benefit from the additional CPU/memory overhead consumed by UnderConstructionBlocks.
- Customers who do use HDFS decommissioning feature but who do not care about HDFS data loss & write failures can choose to leave the feature disabled.
- Main use-case for enabling the feature is for HDFS clusters with many concurrent write operations & datanode decommissioning operations. These clusters will see benefit of reduced HDFS data loss & write failures caused by decommissioning.
In the regular case, when a DFSOutputStream is closed it takes 1-2 seconds for the block replicas to be removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, datanode decommissioning is only blocked until the DFSOutputStream is closed & the write operation is finished, after this time there is minimal delay in unblocking decommissioning.
In the unhappy case, when an HDFS client fails & the DFSOutputStream is never closed, then it takes dfs.namenode.lease-hard-limit-sec = 20 minutes before the lease expires & the Namenode recovers the block. As part of block recovery, the block replicas are removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, if an HDFS client fails it will (by default) take 20 minutes before decommissioning becomes unblocked.
The UnderConstructionBlocks data structure is in-memory only & therefore if the Namenode is restarted then it will lose track of any previously reported Under Construction blocks. This means that datanodes can be decommissioned with Under Construction blocks if the Namenode is restarted (which makes HDFS data loss & write failures possible again).
Testing shows that UnderConstructionBlocks should not leak any Under Construction blocks (i.e. which are never removed). However, as a safeguard to monitor for this issue, any block replica open for over 2 hours will have a WARN log printed by the Namenode every 30 minutes which mentions how long the block has been open for.
The implementation of UnderConstructionBlocks was borrowed from existing code in PendingDataNodeMessages. PendingDataNodeMessages is already used by the BlockManager to track in-memory the block replicas which have been reported to the standby namenode out-of-order.

How was this patch tested?

TODO - will add detailed test results

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
[n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
[n/a] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…ion blocks are sufficiently replicated which causes HDFS Data Loss

hadoop-yetus · 2024-11-22T01:35:54Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	17m 35s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 2 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	49m 41s		trunk passed
+1 💚	compile	1m 28s		trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	compile	1m 16s		trunk passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	checkstyle	1m 17s		trunk passed
+1 💚	mvnsite	1m 27s		trunk passed
+1 💚	javadoc	1m 17s		trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javadoc	1m 47s		trunk passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	spotbugs	3m 24s		trunk passed
+1 💚	shadedclient	41m 59s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 14s		the patch passed
+1 💚	compile	1m 18s		the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚	javac	1m 18s		the patch passed
+1 💚	compile	1m 9s		the patch passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
+1 💚	javac	1m 9s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	1m 5s		the patch passed
+1 💚	mvnsite	1m 17s		the patch passed
-1 ❌	javadoc	1m 4s	/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt	hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
+1 💚	javadoc	1m 41s		the patch passed with JDK Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
-1 ❌	spotbugs	3m 22s	/new-spotbugs-hadoop-hdfs-project_hadoop-hdfs.html	hadoop-hdfs-project/hadoop-hdfs generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
+1 💚	shadedclient	42m 26s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	255m 49s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 51s		The patch does not generate ASF License warnings.
		430m 34s

Reason	Tests
SpotBugs	module:hadoop-hdfs-project/hadoop-hdfs
	Load of known null value in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeAllUcBlocksForDatanode(DatanodeDescriptor) At UnderConstructionBlocks.java:in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeAllUcBlocksForDatanode(DatanodeDescriptor) At UnderConstructionBlocks.java:[line 198]
	Possible null pointer dereference of reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.addUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.addUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:[line 242]
	Possible null pointer dereference of reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:[line 137]
Failed junit tests	hadoop.tools.TestHdfsConfigFields

Subsystem	Report/Notes
Docker	ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7179/1/artifact/out/Dockerfile
GITHUB PR	#7179
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux b4b19a546e9f 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `e0193a2`
Default Java	Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-ga~~us1-0ubuntu2~~20.04-ga
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7179/1/testReport/
Max. process+thread count	3051 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7179/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

HDFS-17658. HDFS decommissioning does not consider if Under Construct…

e0193a2

…ion blocks are sufficiently replicated which causes HDFS Data Loss

github-actions bot added HDFS trunk labels Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179

HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179

KevinWikant commented Nov 21, 2024 •

edited

Loading

hadoop-yetus commented Nov 22, 2024

HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179

Are you sure you want to change the base?

HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179

Conversation

KevinWikant commented Nov 21, 2024 • edited Loading

Problem

Potential Solutions

Solution

Implementation Details

How was this patch tested?

For code changes:

hadoop-yetus commented Nov 22, 2024

KevinWikant commented Nov 21, 2024 •

edited

Loading