[#2725] fix(spark)(partition-split): Add fallback under load-balance mode and fix stale assignment missing callback that caused timeout by zuston · Pull Request #2729 · apache/uniffle

zuston · 2026-02-12T03:39:19Z

What changes were proposed in this pull request?

Fallback to random server when no servers are available in load-balance mode
Fix stale assignment missing callback in data pusher that caused the writer to hang until timeout, preventing reassign from being triggered

Why are the changes needed?

fix the #2725 . Finally tracked down and fixed this tricky bug after a thorough investigation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests

…e available under load-balance mode

roryqi · 2026-02-12T04:01:19Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/MutableShuffleHandleInfo.java

            // 0, 1, 2
            int idx = (int) (taskAttemptId % (serverSize - 1)) + 1;
            candidate = servers.get(idx);
+          } else {


Actually this is a strategy. Maybe it will better to separate mechanism and strategy.

This change targets the second-split scenario where all load-balanced servers have already been split.

In this case, we should fall back to selecting a server using the hash-based algorithm.

At the current stage, this approach is sufficient for large-scale partition workloads. Introducing additional mechanisms would add unnecessary complexity without clear benefit.

roryqi · 2026-02-12T04:08:35Z

client-spark/common/src/main/java/org/apache/spark/shuffle/writer/DataPusher.java

+                          validBlocks,
+                          () -> !isValidTask(taskId));
+                  // completionCallback should be executed before updating taskToSuccessBlockIds
+                  // structure to avoid side effect


Could u elaborate more about side effect?

original logic

github-actions · 2026-02-12T04:38:17Z

Test Results

3 185 files ± 0 3 185 suites ±0 6h 52m 23s ⏱️ +37s
1 245 tests + 2 1 244 ✅ + 2 1 💤 ±0 0 ❌ ±0
15 774 runs +20 15 759 ✅ +20 15 💤 ±0 0 ❌ ±0

Results for commit 0544fac. ± Comparison against base commit 2731cf2.

♻️ This comment has been updated with latest results.

zuston added 7 commits February 12, 2026 10:20

[apache#2725] fix(spark): Fallback on second split when no servers ar…

b7d5e42

…e available under load-balance mode

fix

ade6a73

correct logs

99f7884

fix

d4bfeb0

detailed message

3d7a038

fix race condition

6fcde96

add tests

5040fba

zuston mentioned this pull request Feb 12, 2026

[Bug] Block send failed in partition-split mode due to retry time exceeded #2725

Closed

ci fix

e54434c

roryqi reviewed Feb 12, 2026

View reviewed changes

ci fix

0544fac

zuston requested a review from roryqi February 12, 2026 06:03

roryqi approved these changes Feb 12, 2026

View reviewed changes

zuston linked an issue Feb 12, 2026 that may be closed by this pull request

[Bug] Block send failed in partition-split mode due to retry time exceeded #2725

Closed

zuston merged commit 9c0c27d into apache:master Feb 12, 2026
80 of 81 checks passed

zuston deleted the secondsplit branch February 12, 2026 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#2725] fix(spark)(partition-split): Add fallback under load-balance mode and fix stale assignment missing callback that caused timeout#2729

[#2725] fix(spark)(partition-split): Add fallback under load-balance mode and fix stale assignment missing callback that caused timeout#2729
zuston merged 9 commits intoapache:masterfrom
zuston:secondsplit

zuston commented Feb 12, 2026

Uh oh!

roryqi Feb 12, 2026

Uh oh!

zuston Feb 12, 2026

Uh oh!

roryqi Feb 12, 2026

Uh oh!

zuston Feb 12, 2026

Uh oh!

github-actions bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zuston commented Feb 12, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

roryqi Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

zuston Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

roryqi Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

zuston Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 12, 2026 •

edited

Loading