Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEZ-4580: Slow preemption of new containers when re-use is enabled #374

Merged
merged 5 commits into from
Dec 23, 2024

Conversation

himanshu-mishra
Copy link
Contributor

When container reuse is enabled, preemption of lower priority containers that are not yet assigned to task, takes long time as they are released one at a time, and not the number of containers based when tez.am.preemption.percentage is high added in https://issues.apache.org/jira/browse/TEZ-1742.

Further investigation lead to following conclusion:

  1. Warn log / Assertion error thrown because in preemptIfNeeded(), when releasing new containers, the loop counter is being decremented with each releaseUnassignedContainers, leading to looping only half number of times. By using another counter, assertion passes because of condition method returns with check if (numPendingRequestsToService < 1) {.

  2. In releaseContainer(), the container is not getting removed from delayedContainers queue and only from heldContainers map, hence same container is being picked up for release in every iteration till next cycle of DelayedContainerManager finds out that the container is not in heldContainers and skips it with log Skipping delayed container as container is no longer running, containerId=...

This change adds a method in DelayedContainerManager to allow removal of delayed container and invokes it in releaseContainer method, which so far only removed it from heldContainers map.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 9s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 15m 50s master passed
+1 💚 compile 0m 26s master passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 compile 0m 22s master passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+1 💚 checkstyle 1m 21s master passed
+1 💚 javadoc 0m 31s master passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 javadoc 0m 16s master passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+0 🆗 spotbugs 1m 20s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 19s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 15s the patch passed
+1 💚 compile 0m 17s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 javac 0m 17s the patch passed
+1 💚 compile 0m 14s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+1 💚 javac 0m 14s the patch passed
-0 ⚠️ checkstyle 0m 13s tez-dag: The patch generated 1 new + 146 unchanged - 0 fixed = 147 total (was 146)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 7s the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04
+1 💚 javadoc 0m 7s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~22.04-b05
+1 💚 findbugs 0m 46s the patch passed
_ Other Tests _
+1 💚 unit 4m 25s tez-dag in the patch passed.
+1 💚 asflicense 0m 15s The patch does not generate ASF License warnings.
27m 34s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/artifact/out/Dockerfile
GITHUB PR #374
JIRA Issue TEZ-4580
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 7ae9360fbab1 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 607b2bc
Default Java Private Build-1.8.0_422-8u422-b05-1~22.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu322.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~22.04-b05
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/artifact/out/diff-checkstyle-tez-dag.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/testReport/
Max. process+thread count 201 (vs. ulimit of 5500)
modules C: tez-dag U: tez-dag
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/1/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

if (delayedContainers.remove(container)) {
LOG.info("Removed {} from delayed containers", container.getContainer().getId());
} else {
LOG.warn("Unknown container {} sent for removal. Ignoring.", container.getContainer().getId());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valid case here is when a new container is allocated - added to delayedContainers, it is polled (removed) from queue but if there is no pending task request releaseContainer() is invoked. Please suggest if the log level should be changed in info or for both the newly added logs should be at debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we should change the log level to debug and print the only if LOG.isDebugEnabled() is true,

Similar to :
https://github.com/apache/tez/pull/374/files#diff-aaf0735de6615f8eeb8469ceadca2c3b0cdd578544a2564c37b299c55347188eR1548-R1550

Rest of the code LGTM .

Copy link
Contributor

@abstractdog abstractdog Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about the log levels, let's keep the simple remove path on debug, as looks like the happy preemption path:

            LOG.debug("Removed {} from delayed containers", container.getContainer().getId());

isDebugEnabled is not necessary as long as we use the {} formatting (to prevent unnecessary string creation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.
Thank you @abstractdog for the review ! :)

prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 14, 2024
prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
prabhjyotsingh added a commit to acceldata-io/tez that referenced this pull request Nov 20, 2024
shubhluck pushed a commit to acceldata-io/tez that referenced this pull request Nov 21, 2024
@simhadri-g
Copy link
Member

simhadri-g commented Dec 10, 2024

@abstractdog can you please take a look when you are free. Thanks in advance! :)

@@ -2163,6 +2166,17 @@ void addDelayedContainer(Container container,
}
}

void removeDelayedContainer(HeldContainer container) {
if (container != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this null check is not needed, as it's called from a block

    if (delayedContainer != null) {

@abstractdog
Copy link
Contributor

thanks @himanshu-mishra for taking care of this

according to the description of TEZ-1742:

Tez YARN Task Scheduler currently preempts 1 running task at a time when a higher priority task is waiting and there are no available resources. When a large number of higher priority tasks are pending then it can take a long time to preempt the required number of lower priority tasks.

after checking and trying out the unit test and the fix here, it seems like this patch what's exactly fixes the abovementioned problem

apart from the assertion error fix, the unit test indeed proves that releaseAssignedContainer is called as many times as needed according to the percentage config in a single round, so this looks good to me

only left a minor comment regarding a null check

@himanshu-mishra
Copy link
Contributor Author

Thanks @simhadri-g , @abstractdog for review. I have incorporated feedback as following:

  1. Keep only happy path log
  2. Remove null check
  3. Change log from info to debug

Also adding a note that as the the changes fixes slow preemption, it leads to performance gain mentioned in TEZ-1742 when container re-use is enabled, which is enabled by default in Tez.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 9s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 5m 34s master passed
+1 💚 compile 0m 22s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 compile 0m 21s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 checkstyle 0m 53s master passed
+1 💚 javadoc 0m 25s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 12s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+0 🆗 spotbugs 1m 13s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 12s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 14s the patch passed
+1 💚 compile 0m 16s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javac 0m 16s the patch passed
+1 💚 compile 0m 15s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 javac 0m 15s the patch passed
-0 ⚠️ checkstyle 0m 10s tez-dag: The patch generated 1 new + 146 unchanged - 0 fixed = 147 total (was 146)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 6s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 6s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 findbugs 0m 41s the patch passed
_ Other Tests _
+1 💚 unit 4m 14s tez-dag in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
16m 3s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/2/artifact/out/Dockerfile
GITHUB PR #374
JIRA Issue TEZ-4580
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux dd3da0311850 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 1084699
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/2/artifact/out/diff-checkstyle-tez-dag.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/2/testReport/
Max. process+thread count 227 (vs. ulimit of 5500)
modules C: tez-dag U: tez-dag
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/2/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 14s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 7m 29s master passed
+1 💚 compile 0m 42s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 compile 0m 40s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 checkstyle 0m 52s master passed
+1 💚 javadoc 0m 29s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 24s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+0 🆗 spotbugs 1m 27s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 1m 25s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 26s the patch passed
+1 💚 compile 0m 28s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javac 0m 28s the patch passed
+1 💚 compile 0m 25s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 javac 0m 25s the patch passed
-0 ⚠️ checkstyle 0m 21s tez-dag: The patch generated 1 new + 146 unchanged - 0 fixed = 147 total (was 146)
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 9s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 9s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 findbugs 1m 10s the patch passed
_ Other Tests _
+1 💚 unit 5m 6s tez-dag in the patch passed.
+1 💚 asflicense 0m 11s The patch does not generate ASF License warnings.
21m 40s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/3/artifact/out/Dockerfile
GITHUB PR #374
JIRA Issue TEZ-4580
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux ea7ca84329e4 5.15.0-126-generic #136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / 1084699
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
checkstyle https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/3/artifact/out/diff-checkstyle-tez-dag.txt
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/3/testReport/
Max. process+thread count 217 (vs. ulimit of 5500)
modules C: tez-dag U: tez-dag
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/3/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@abstractdog
Copy link
Contributor

@himanshu-mishra : minor checkstyle left, can you please address:
./tez-dag/src/test/java/org/apache/tez/dag/app/rm/TestTaskScheduler.java:1042: // We don't want containers to be assigned to a task by delayedContainerManager as it invokes another preemption flow: Line is longer than 120 characters (found 121). [LineLength]

@tez-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 9s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ master Compile Tests _
+1 💚 mvninstall 6m 30s master passed
+1 💚 compile 0m 25s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 compile 0m 21s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 checkstyle 0m 33s master passed
+1 💚 javadoc 0m 21s master passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 16s master passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+0 🆗 spotbugs 1m 1s Used deprecated FindBugs config; considering switching to SpotBugs.
+1 💚 findbugs 0m 59s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 0m 19s the patch passed
+1 💚 compile 0m 17s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javac 0m 17s the patch passed
+1 💚 compile 0m 13s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 javac 0m 13s the patch passed
+1 💚 checkstyle 0m 13s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 javadoc 0m 7s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04
+1 💚 javadoc 0m 8s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
+1 💚 findbugs 0m 47s the patch passed
_ Other Tests _
+1 💚 unit 4m 15s tez-dag in the patch passed.
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
16m 50s
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/4/artifact/out/Dockerfile
GITHUB PR #374
JIRA Issue TEZ-4580
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs checkstyle compile
uname Linux 0ae4067883a1 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality personality/tez.sh
git revision master / b95defc
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu122.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu222.04-ga
Test Results https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/4/testReport/
Max. process+thread count 202 (vs. ulimit of 5500)
modules C: tez-dag U: tez-dag
Console output https://ci-hadoop.apache.org/job/tez-multibranch/job/PR-374/4/console
versions git=2.34.1 maven=3.6.3 findbugs=3.0.1
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@abstractdog abstractdog self-requested a review December 23, 2024 09:29
Copy link
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@abstractdog abstractdog merged commit 9efa6f1 into apache:master Dec 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants