Skip to content

[#2725] fix(spark)(partition-split): Add fallback under load-balance mode and fix stale assignment missing callback that caused timeout#2729

Merged
zuston merged 9 commits intoapache:masterfrom
zuston:secondsplit
Feb 12, 2026
Merged

[#2725] fix(spark)(partition-split): Add fallback under load-balance mode and fix stale assignment missing callback that caused timeout#2729
zuston merged 9 commits intoapache:masterfrom
zuston:secondsplit

Conversation

@zuston
Copy link
Member

@zuston zuston commented Feb 12, 2026

What changes were proposed in this pull request?

  1. Fallback to random server when no servers are available in load-balance mode
  2. Fix stale assignment missing callback in data pusher that caused the writer to hang until timeout, preventing reassign from being triggered

Why are the changes needed?

fix the #2725 . Finally tracked down and fixed this tricky bug after a thorough investigation.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests

@zuston zuston changed the title [#2725] fix(spark)(partition-split): Fallback under load-balance mode and fix stale assignment missing callback to cause timeout [#2725] fix(spark)(partition-split): Add fallback under load-balance mode and fix stale assignment missing callback that caused timeout Feb 12, 2026
// 0, 1, 2
int idx = (int) (taskAttemptId % (serverSize - 1)) + 1;
candidate = servers.get(idx);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is a strategy. Maybe it will better to separate mechanism and strategy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change targets the second-split scenario where all load-balanced servers have already been split.

In this case, we should fall back to selecting a server using the hash-based algorithm.

At the current stage, this approach is sufficient for large-scale partition workloads. Introducing additional mechanisms would add unnecessary complexity without clear benefit.

validBlocks,
() -> !isValidTask(taskId));
// completionCallback should be executed before updating taskToSuccessBlockIds
// structure to avoid side effect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could u elaborate more about side effect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original logic

@github-actions
Copy link

github-actions bot commented Feb 12, 2026

Test Results

 3 185 files  ± 0   3 185 suites  ±0   6h 52m 23s ⏱️ +37s
 1 245 tests + 2   1 244 ✅ + 2   1 💤 ±0  0 ❌ ±0 
15 774 runs  +20  15 759 ✅ +20  15 💤 ±0  0 ❌ ±0 

Results for commit 0544fac. ± Comparison against base commit 2731cf2.

♻️ This comment has been updated with latest results.

@zuston zuston requested a review from roryqi February 12, 2026 06:03
@zuston zuston merged commit 9c0c27d into apache:master Feb 12, 2026
80 of 81 checks passed
@zuston zuston deleted the secondsplit branch February 12, 2026 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Block send failed in partition-split mode due to retry time exceeded

2 participants