Skip to content

Conversation

JacobZheng0927
Copy link
Contributor

What changes were proposed in this pull request?

This PR optimizes BlockManager remove operations by introducing cached mappings to eliminate O(n) linear scans. The main changes are:

  1. Introduced three concurrent hash maps to track block ID associations:

    • rddToBlockIds: Maps RDD ID to its block IDs
    • broadcastToBlockIds: Maps broadcast ID to its block IDs
    • sessionToBlockIds: Maps session UUID to its cache block IDs
  2. Added cache maintenance methods:

    • addToCache(blockId): Updates caches when blocks are stored
    • removeFromCache(blockId): Updates caches when blocks are deleted
  3. Reworked remove operations to use cached lookups:

    • removeRdd(), removeBroadcast(), and removeCache() now perform O(1) lookups instead of scanning all entries
  4. Integrated with block lifecycle:

    • doPutIterator() calls addToCache() after successful block storage
    • removeBlock() calls removeFromCache() when blocks are removed

Why are the changes needed?

Previously, removeRdd(), removeBroadcast(), and removeCache() required scanning all blocks in blockInfoManager.entries to find matches. This approach becomes a serious bottleneck when:

  1. Large block counts: In production deployments with millions or even tens of millions of cached blocks, linear scans can be prohibitively slow
  2. High cleanup frequency: Workloads that repeatedly create and discard RDDs or broadcast variables accumulate overhead quickly

The original removeRdd() method already contained a TODO noting that an additional mapping would be needed to avoid linear scans. This PR implements that improvement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Unit tests: Verified the correctness of removeRdd(), removeBroadcast(), and removeCache(), including edge cases.
  • Stress tests: Ran multiple simple tasks using broadcast joins under sustained high concurrency to validate performance and stability of the optimized remove operations.

Before optimization
image

After optimization

image

The optimization delivers significant performance improvements for block cleanup under large data volumes, reducing the overhead caused by frequent GC when blocks accumulate.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant