Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

KevinWikant
Copy link
Contributor

@KevinWikant KevinWikant commented Nov 21, 2024

Problem

Problem background:

  • A datanode should only enter decommissioned state if all the blocks on the datanode are sufficiently replicated to other live datanodes.
  • This expectation is violated for Under Construction blocks which are not considered by the DatanodeAdminMonitor at all.
  • DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor StorageInfos. This is because:
    • For a new HDFS block that was just created, it is not be added to the StorageInfos until the HDFS client closes the DFSOutputStream & the block becomes finalized
    • For an existing HDFS block that was opened for append:
      • First, the block version with the previous generation stamp is marked stale & removed from the StorageInfos
      • Next, the block version with the new generation stamp is not be added to the StorageInfos until the HDFS client closes the DFSOutputStream & the block becomes finalized

There is logic in the DatanodeAdminManager/DatanodeAdminMonitor to avoid transitioning datanodes to decommissioned state when they have open (i.e. Under Construction) blocks:

This logic does not work correctly because, as mentioned above, DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor StorageInfos which does not include Under Construction blocks for which the DFSOutputStream has not been closed yet.

There is also logic in the HDFS DataStreamer client which will replace bad/dead datanodes in the block write pipeline. Note that:

  • this logic does not work if the replication factor is 1
  • if the replication factor is greater than 1, then this logic does not work if all the datanodes in the block write pipeline are decommissioned/terminated at around the same time

Overall, the Namenode should not be putting datanodes with open blocks into decommissioned state & hope that the DataStreamer client is able to replace them when the decommissioned datanodes are terminated. This will not work depending on the timing & therefore is not a solution which guarantees correctness.

The Namenode needs to honor the rule that "a datanode should only enter decommissioned state if all the blocks on the datanode are sufficiently replicated to other live datanodes", even for blocks which are currently Under Construction.

Potential Solutions

One possible opinion is that if the DFSOutputStream has not been successfuly closed yet, then the client should be able to replay all the data if there is a failure. The client should not have any expectation the data is committed to HDFS until the DFSOutputStream is closed. There are a few reasons I do not think this makes sense:

  • The methods hflush/hsync do not result in the data already appended to the DFSOutputStream being persisted/finalized. This is confusing when compared the standard experience of stream flush/sync methods.
  • This does not handle the case where a block is re-opened by a new DFSOutputStream after having been previously closed (by another different client). In this case, the problem will lead to data loss for data that was previously committed by another client & cannot be replayed.
    • To solve this problem, we could try not removing old block version from StorageInfos when a new block version is created; however, this change is likely to have wider implications on block management.

Another possible option that comes to mind is to add blocks to StorageInfos before they are finalized. However, this change also is likely to have wider implications on block management.

Without modifying any existing block management logic, we can add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.

Solution

Add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.

Pros:

  • works for newly created HDFS block
  • works for re-opened HDFS block (i.e. opened for append)
  • works for block with any replication factor
  • does not change logic in BlockManager, the new UnderConstructionBlocks data structure & associated logic is purely additive

Implementation Details

  • Feature is behind a configuration "dfs.namenode.decommission.track.underconstructionblocks" which is disabled by default.

    • When enabled, feature prevents HDFS data loss & write failures at the cost of potentially slowing down the decommissioning process.
    • Customer who do not use HDFS decommissioning feature can choose to leave the feature disabled as they will not benefit from the additional CPU/memory overhead consumed by UnderConstructionBlocks.
    • Customers who do use HDFS decommissioning feature but who do not care about HDFS data loss & write failures can choose to leave the feature disabled.
    • Main use-case for enabling the feature is for HDFS clusters with many concurrent write operations & datanode decommissioning operations. These clusters will see benefit of reduced HDFS data loss & write failures caused by decommissioning.
  • In the regular case, when a DFSOutputStream is closed it takes 1-2 seconds for the block replicas to be removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, datanode decommissioning is only blocked until the DFSOutputStream is closed & the write operation is finished, after this time there is minimal delay in unblocking decommissioning.

  • In the unhappy case, when an HDFS client fails & the DFSOutputStream is never closed, then it takes dfs.namenode.lease-hard-limit-sec = 20 minutes before the lease expires & the Namenode recovers the block. As part of block recovery, the block replicas are removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, if an HDFS client fails it will (by default) take 20 minutes before decommissioning becomes unblocked.

  • The UnderConstructionBlocks data structure is in-memory only & therefore if the Namenode is restarted then it will lose track of any previously reported Under Construction blocks. This means that datanodes can be decommissioned with Under Construction blocks if the Namenode is restarted (which makes HDFS data loss & write failures possible again).

  • Testing shows that UnderConstructionBlocks should not leak any Under Construction blocks (i.e. which are never removed). However, as a safeguard to monitor for this issue, any block replica open for over 2 hours will have a WARN log printed by the Namenode every 30 minutes which mentions how long the block has been open for.

  • The implementation of UnderConstructionBlocks was borrowed from existing code in PendingDataNodeMessages. PendingDataNodeMessages is already used by the BlockManager to track in-memory the block replicas which have been reported to the standby namenode out-of-order.

How was this patch tested?

TODO - will add detailed test results

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • [n/a] If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…ion blocks are sufficiently replicated which causes HDFS Data Loss
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 17m 35s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 49m 41s trunk passed
+1 💚 compile 1m 28s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 1m 16s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 1m 17s trunk passed
+1 💚 mvnsite 1m 27s trunk passed
+1 💚 javadoc 1m 17s trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 1m 47s trunk passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 3m 24s trunk passed
+1 💚 shadedclient 41m 59s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 14s the patch passed
+1 💚 compile 1m 18s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 1m 18s the patch passed
+1 💚 compile 1m 9s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 1m 9s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 5s the patch passed
+1 💚 mvnsite 1m 17s the patch passed
-1 ❌ javadoc 1m 4s /patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.
+1 💚 javadoc 1m 41s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
-1 ❌ spotbugs 3m 22s /new-spotbugs-hadoop-hdfs-project_hadoop-hdfs.html hadoop-hdfs-project/hadoop-hdfs generated 3 new + 0 unchanged - 0 fixed = 3 total (was 0)
+1 💚 shadedclient 42m 26s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 255m 49s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 51s The patch does not generate ASF License warnings.
430m 34s
Reason Tests
SpotBugs module:hadoop-hdfs-project/hadoop-hdfs
Load of known null value in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeAllUcBlocksForDatanode(DatanodeDescriptor) At UnderConstructionBlocks.java:in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeAllUcBlocksForDatanode(DatanodeDescriptor) At UnderConstructionBlocks.java:[line 198]
Possible null pointer dereference of reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.addUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.addUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:[line 242]
Possible null pointer dereference of reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:reportedBlock in org.apache.hadoop.hdfs.server.blockmanagement.UnderConstructionBlocks.removeUcBlock(DatanodeDescriptor, Block) Dereferenced at UnderConstructionBlocks.java:[line 137]
Failed junit tests hadoop.tools.TestHdfsConfigFields
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7179/1/artifact/out/Dockerfile
GITHUB PR #7179
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux b4b19a546e9f 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / e0193a2
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7179/1/testReport/
Max. process+thread count 3051 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7179/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants