Skip to content

insertDeallocate inspects inner scopes#6007

Open
Priya2698 wants to merge 10 commits intomainfrom
pm/deallocate
Open

insertDeallocate inspects inner scopes#6007
Priya2698 wants to merge 10 commits intomainfrom
pm/deallocate

Conversation

@Priya2698
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 24, 2026

Review updated until commit 80a701c

Description

  • Replace simple last-use deallocation with post-dominator tree analysis

  • Add PostDominatorTree class for precise deallocation point determination

  • Implement needsDeallocation function with comprehensive filtering logic

  • Add integration tests for host IR deallocation behavior

Changes walkthrough

Relevant files
Enhancement
allocate_and_deallocate.cpp
Post-dominator based deallocation analysis                             

csrc/host_ir/allocate_and_deallocate.cpp

  • Add PostDominatorTree class for post-dominator analysis
  • Replace insertDeallocations with post-dominator based approach
  • Add needsDeallocation function with filtering for
    inputs/outputs/sharding
  • Refactor Node class and DominatorTree implementation
  • Update allocation insertion to use insert_before instead of iterator
  • +144/-75
    Tests
    test_host_ir_passes.cpp
    Host IR deallocation integration tests                                     

    tests/cpp/test_host_ir_passes.cpp

  • Add HostIrPassesTest fixture with host IR lowering enabled
  • Test TwoMatmulsInlinable scenario with single deallocate in loop body
  • Test TwoMatmulsNotInlinable scenario with top-level deallocate
  • Verify deallocation count and placement in both scenarios
  • +145/-0 
    Configuration changes
    CMakeLists.txt
    Add new test file to build configuration                                 

    CMakeLists.txt

    • Add test_host_ir_passes.cpp to HOSTIR_TEST_SRCS list
    +1/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    C++20 Features Usage

    The code uses C++20 features like <ranges> and std::views::reverse. This may cause compilation issues on systems with older C++ standard support. Consider adding appropriate compile-time checks or providing fallback implementations.

    #include <ranges>
    Memory Management in PostDominatorTree

    The PostDominatorTree class stores Node objects directly in an unordered_map, but the findLCA method returns pointers to these nodes. Ensure that the lifetime and validity of these references are properly managed, especially when the unordered_map may rehash and invalidate references.

    const Node* findLCA(const Node* a, const Node* b) const {
      if (a == nullptr) {
        return b;
      }
      if (b == nullptr) {
        return a;
      }
      while (a->depth() > b->depth()) {
        a = a->parent();
      }
      while (b->depth() > a->depth()) {
        b = b->parent();
      }
      while (a != b) {
        a = a->parent();
        b = b->parent();
      }
      return a;
    }
    Error Handling Robustness

    The code assumes that LCA nodes will always be found for tensors that need deallocation. Consider adding more defensive error handling for edge cases where post-dominator analysis might fail to find appropriate insertion points.

      NVF_ERROR(
          lca_node != nullptr, "Could not find post-dominator for tensor ", tv);
      auto* deallocate = IrBuilder::create<hir::Deallocate>(tv);
      lca_node->scope()->insert_after(lca_node->getExpr(), deallocate);
    }

    @Priya2698
    Copy link
    Collaborator Author

    !test

    /*pre_fn=*/
    [&](const DominatorTree::Node* node) {
    Expr* e = node->getExpr();
    if (auto* alloc = dynamic_cast<kir::Allocate*>(e)) {
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Temporary if. With hir::allocate, this can be captured with inputs().

    @Priya2698 Priya2698 mentioned this pull request Feb 24, 2026
    }
    return true;
    }

    Copy link
    Collaborator Author

    @Priya2698 Priya2698 Feb 24, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This is not complete. Ops like view do not always allocate new tensors. Few things make that analysis tricky:

    1. Aliasing information is not available in hic as it copies over from completeFusion. Question: Should we run an aliasing pass in host IR as well for expr-evaluated segments?
    2. HostIrEvaluator for LoadStoreOp checks if the out_tv is known and either copies over the data or binds it to a view of the input. HostIrJit always creates a new tensor for ops like permute:
      void* permute_func_ptr = reinterpret_cast<void*>(
      . Question: Is this just for simplicity for the first integration?

    One solution can be to explicitly allocate expr eval outputs where needed like we do for matmul/linear. Then, we only deallocate tvs that are allocated.

    The previous version did not make any distinction for view-like ops, so the functionality does not regress.

    What do you think @wujingyue

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Sorry, I'm missing some context. Can you remind me why this PR needs to change how we decide what needs deallocation? I understood the motivation of looking into loops but I'm missing some connections otherwise.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you remind me why this PR needs to change how we decide what needs deallocation?

    This PR does not need to necessarily change this. But we do need to decide what needs deallocation since not all ops allocate new tensors.

    I initially started with deallocating only explicitly "allocated" tensorviews. However, that breaks the HostIrJit tests where outputs of view/permute are also new allocated tensors.
    If I deallocate everything, that includes outputs of ShardByStream which are not new tensors, rather slices (we could amend aliasing such that we mark these as aliases of their inputs). Hence, I am placing some minimum conditions on what needs deallocation for current use cases.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    If I deallocate everything, that includes outputs of ShardByStream which are not new tensors

    Got it. This is actually the old behavior. It didn't trigger this problem because ShardByStream is never top-level.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    HostIrEvaluator handles deallocation by removing the tensor from the underlying hash table. It doesn't always free the memory. What problems did you run into with ShardByStream exactly?

    I can try it myself tomorrow. Not on a computer right now

    Copy link
    Collaborator Author

    @Priya2698 Priya2698 Feb 26, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    HostIrEvaluator handles deallocation by removing the tensor from the underlying hash table. It doesn't always free the memory.

    Correct. I did not run into any errors with existing tests since handle(Deallocate*) only invalidates. But looking at the HostIrJit behavior, it actually deletes the tensor, hence, I avoided adding deallocation statements for those.

    For simplicity, if you prefer, I can remove the additional conditions from this PR, and we can discuss that in a separate PR.

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698
    Copy link
    Collaborator Author

    !test

    @Priya2698 Priya2698 marked this pull request as ready for review February 25, 2026 00:05
    @Priya2698 Priya2698 requested a review from wujingyue February 25, 2026 00:06
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 25, 2026

    Greptile Summary

    Refactored insertDeallocations to use post-dominator tree analysis with LCA (Lowest Common Ancestor) to determine optimal deallocation placement in inner scopes, enabling more efficient memory management for inlined operations.

    Key Changes:

    • Extracted Node class from DominatorTree and added depth tracking
    • Implemented PostDominatorTree class that builds tree in reverse order and computes LCA for tensor last-use analysis
    • Rewrote insertDeallocations to insert deallocations at the LCA node in the post-dominator tree instead of simple reverse iteration
    • Added needsDeallocation helper to filter fusion inputs/outputs and aliased buffers
    • Added comprehensive tests verifying correct deallocation placement for both inlinable and non-inlinable matmul scenarios

    Issues Found:

    • Potential null pointer dereference in needsDeallocation when calling tv->definition()->isA<ShardByStream>() without null check

    Confidence Score: 3/5

    • This PR introduces a critical null pointer dereference that needs to be fixed before merging
    • The refactoring is well-structured with comprehensive tests, but contains a null pointer dereference bug that could cause runtime crashes when processing tensors without definitions
    • Pay close attention to csrc/host_ir/allocate_and_deallocate.cpp line 269 for the null pointer dereference fix

    Important Files Changed

    Filename Overview
    CMakeLists.txt Added test_host_ir_passes.cpp to build configuration
    csrc/host_ir/allocate_and_deallocate.cpp Refactored deallocation logic to use post-dominator tree and LCA analysis; potential null pointer dereference at line 269
    tests/cpp/test_host_ir_passes.cpp Added comprehensive tests for deallocation placement in inlinable and non-inlinable scenarios

    Last reviewed commit: 80a701c

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 2 comments

    Edit Code Review Agent Settings | Greptile

    if (!needsDeallocation(allocated_tv)) {
    continue;
    }
    const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    .at() will throw if allocated_tv is not in last_use map. This can happen if a tensor is allocated (via kir::Allocate) but never used as input to any expression. In post_fn (line 240-248), last_use is only populated for allocated tensors and tensors used as inputs - but if an allocated tensor is never used as input, it won't be in last_use.

    Suggested change
    const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);
    const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);

    if (tv->isFusionOutput()) {
    return false;
    }
    if (tv->definition()->isA<ShardByStream>()) {
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Add null check before dereferencing tv->definition(). While fusion inputs are filtered at line 198, there may be other cases where definition() is nullptr (e.g., intermediate values without definitions).

    Suggested change
    if (tv->definition()->isA<ShardByStream>()) {
    if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {

    }
    return true;
    }

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Sorry, I'm missing some context. Can you remind me why this PR needs to change how we decide what needs deallocation? I understood the motivation of looking into loops but I'm missing some connections otherwise.

    if (in->isFusionOutput()) {
    continue;
    }
    DominatorTree dom_tree(hic);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    (I haven't yet read the code carefully so I could well be missing some important details.)

    In compiler, there's a thing called post dominator tree, which could lead to a more general solution. Post dominator tree is similar to dominator tree but with the control flow reversed. IIUC, the position to deallocate a tv (if it needs to be deallocated) is the nearest post dominator of tv and all its uses. This can be computed as their lowest common ancestor (LCA) in the post dominator tree.

    I WFH today but I'm happy to explain this in person tomorrow or Friday.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Can you take a look at the latest commit for the post-dominator tree version?

    namespace {

    class DominatorTree {
    class Node {
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    TODO: create different nodes in each tree to avoid overloading too much.

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 25, 2026

    Additional Comments (1)

    csrc/host_ir/allocate_and_deallocate.cpp, line 269
    Add null check before dereferencing definition(). While fusion inputs/outputs are filtered above, a TensorView may not have a definition in other cases (e.g., allocated but unused tensors).

      if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants