`insertDeallocate` inspects inner scopes by Priya2698 · Pull Request #6007 · NVIDIA/Fuser

Priya2698 · 2026-02-24T05:28:31Z

No description provided.

github-actions · 2026-02-24T05:29:28Z

Review updated until commit 2078dbc

Description

Implement post-dominator tree-based deallocation to handle inner scopes correctly
Refactor Node class out of DominatorTree and add parent pointer for tree traversal
Add computeLeastCommonAncestor function to find latest safe deallocation point
Add tests verifying no memory leaks in two matmul fusion scenarios

Changes walkthrough

Relevant files

Enhancement

allocate_and_deallocate.cpp `Implement post-dominator tree for deallocation` csrc/host_ir/allocate_and_deallocate.cpp Extract Node class from DominatorTree and add parent pointer Add standalone depthFirstTraverse function for reuse Implement PostDominatorTree class traversing expressions in reverse Add computeLeastCommonAncestor using post-dominator tree Rewrite insertDeallocations to use LCA-based deallocation strategy Replace std::for_each with std::ranges::for_each	+190/-100

Tests

test_host_ir_passes.cpp `Add memory leak tests for host IR passes` tests/cpp/test_host_ir_passes.cpp Add HostIrPassesTest fixture with HostIrLowering enabled Add TwoMatmulsInlinable test case checking memory leak Add TwoMatmulsNotInlinable test case checking memory leak Add helper functions to collect and validate persistent TensorViews	+164/-0

Configuration changes

CMakeLists.txt `Add new test file to build` CMakeLists.txt Add test_host_ir_passes.cpp to HOSTIR_TEST_SRCS	+1/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Post-dominator tree build iteration issue In `PostDominatorTree::build` (lines 154-176), the reverse iteration uses `auto it = exprs.end(); it != exprs.begin(); --it`. This pattern is problematic because `exprs.end()` is a valid iterator to one-past-the-end, and dereferencing it would be undefined behavior. While the loop body doesn't dereference `it` directly (it first decrements), this pattern is fragile and could cause issues if the container is empty. Consider adding an explicit empty check or using `std::ranges::reverse_view` for safer reverse iteration. void build(Scope& scope, Node* parent) { auto& exprs = scope.exprs(); for (auto it = exprs.end(); it != exprs.begin();) { --it; Expr* e = it; auto [node_it, inserted] = nodes_.try_emplace(e, &scope, it, parent); NVF_ERROR(inserted); Node& node = node_it->second; if (parent != nullptr) { parent->addChild(&node); } if (auto loop = dynamic_cast<hir::ForLoop>(e)) { build(loop->body(), &node); } if (auto ite = dynamic_cast<kir::IfThenElse>(e)) { build(ite->thenBody(), &node); build(ite->elseBody(), &node); } parent = &node; } } Missing inner scope handling in PostDominatorTree* The `PostDominatorTree::build` method handles `hir::ForLoop` and `kir::IfThenElse` (lines 166-172), but there's a potential inconsistency: it uses `hir::ForLoop` but `kir::IfThenElse`. The original `DominatorTree::build` also uses `hir::ForLoop` and `kir::IfThenElse`. However, there may be other scope-creating expressions (e.g., `hir::WhileLoop` or other compound statements) that should also be handled. Verify all scope-creating expressions are properly covered. if (auto* loop = dynamic_cast<hir::ForLoop>(e)) { build(loop->body(), &node); } if (auto ite = dynamic_cast<kir::IfThenElse>(e)) { build(ite->thenBody(), &node); build(ite->elseBody(), &node); } Potential null pointer in LCA computation* In `computeLeastCommonAncestor` (lines 237-294), the `findLCA` lambda handles null nodes by returning the other node (lines 242-247). However, the `depth` map lookup at line 248 uses `depth.at(a)` and `depth.at(b)` without null checks after the initial null handling. If either `a` or `b` becomes null during the while loops (which shouldn't happen based on the logic), this would throw. The logic appears sound but the null handling could be more explicit. auto findLCA = [&](const Node* a, const Node* b) -> const Node* { if (a == nullptr) { return b; } if (b == nullptr) { return a; } int64_t depth_a = depth.at(a); int64_t depth_b = depth.at(b); while (depth_a > depth_b) { a = a->parent(); depth_a--; } while (depth_b > depth_a) { b = b->parent(); depth_b--; } while (a != b) { a = a->parent(); b = b->parent(); } return a; };

…located when not inside for loop

Priya2698 · 2026-02-24T05:44:45Z

!test

csrc/host_ir/allocate_and_deallocate.cpp

Priya2698 · 2026-02-24T21:43:41Z

!test

Priya2698 · 2026-02-24T21:46:19Z

!test

greptile-apps · 2026-02-25T00:09:18Z

Greptile Summary

Refactors insertDeallocations to inspect inner scopes (loops, if-then-else blocks) instead of only top-level expressions. Key changes:

Extracts Node class from DominatorTree with parent pointer support
Introduces PostDominatorTree for reverse control flow analysis
Implements LCA (Least Common Ancestor) algorithm to find optimal deallocation points across all tensor uses
Replaces simple backward iteration with scope-aware deallocation placement using computeLeastCommonAncestor
Adds comprehensive tests for memory leak detection in multi-matmul scenarios

The refactoring addresses previous concerns about .at() throwing errors by eliminating the last_use map approach entirely. The new LCA-based approach correctly handles tensors used across multiple scopes.

Confidence Score: 4/5

This PR is safe to merge with one minor style suggestion
The core refactoring is sound - the LCA algorithm correctly identifies deallocation points, and the post-dominator tree properly traverses inner scopes. The changes successfully address previous concerns about map access errors. Score reflects the single non-critical style issue in the test helper function
No files require special attention - the style suggestion in the test file is minor and optional

Important Files Changed

Filename	Overview
CMakeLists.txt	Adds new test file `test_host_ir_passes.cpp` to the HOSTIR_TEST_SRCS list
csrc/host_ir/allocate_and_deallocate.cpp	Refactors deallocation logic to use post-dominator tree and LCA algorithm for inspecting inner scopes (loops, if-then-else blocks); extracts `Node` class and `depthFirstTraverse` function; replaces top-level-only deallocation with scope-aware placement
tests/cpp/test_host_ir_passes.cpp	New test file with two test cases for memory leak checking; `collectPersistentTensorViews` helper missing `kir::IfThenElse` scope traversal

_{Last reviewed commit: 2078dbc}

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

csrc/host_ir/allocate_and_deallocate.cpp

greptile-apps · 2026-02-25T23:40:05Z

Additional Comments (1)

csrc/host_ir/allocate_and_deallocate.cpp, line 269
Add null check before dereferencing definition(). While fusion inputs/outputs are filtered above, a TensorView may not have a definition in other cases (e.g., allocated but unused tensors).

  if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {

greptile-apps · 2026-02-26T23:53:33Z

Additional Comments (3)

csrc/host_ir/allocate_and_deallocate.cpp, line 45
member variable is iter_ (line 66), not iterator_

    return iter_;

csrc/host_ir/allocate_and_deallocate.cpp, line 49
member variable is iter_, not iterator_

    return *iter_;

csrc/host_ir/allocate_and_deallocate.cpp, line 194
it is a reverse iterator (std::reverse_iterator<ExprList::const_iterator>) but Node constructor expects forward iterator type Scope::Iterator (ExprList::const_iterator). Convert using .base() or use forward iteration with different tree construction logic

greptile-apps · 2026-02-27T00:17:06Z

Additional Comments (2)

csrc/host_ir/allocate_and_deallocate.cpp, line 239
iter() method does not exist on Node

The Node class (defined at line 30) exposes only iterator(), not iter(). This call will fail to compile.

The diff shows this was introduced by this PR, changing node->iterator() → node->iter(), but no corresponding rename was made in the Node class definition.

            node->scope()->insert(node->iterator(), allocate);

csrc/host_ir/allocate_and_deallocate.cpp, line 181
Reverse iterator passed where Scope::Iterator (forward iterator) is expected

scope.exprs() returns const ExprList& (see Scope::exprs() in ir/internal_nodes.h), so rbegin() yields a std::list<Expr*>::const_reverse_iterator. However, the Node constructor signature is:

Node(Scope* scope, Scope::Iterator iterator, const Node* parent)

where Scope::Iterator = ExprList::const_iterator. There is no implicit conversion from const_reverse_iterator to const_iterator, so this will fail to compile.

Beyond the type error, even if the iterator were stored, insertDeallocations calls:

lca_node->scope()->insert(std::next(lca_node->iterator()), deallocate);

std::next on a reverse iterator advances backward in the list, so the insertion point would be wrong (before rather than after the LCA node's expression in program order).

The fix is to convert to a forward iterator before storing. The element pointed to by reverse iterator it can be obtained as a forward iterator via std::prev(it.base()):

    for (auto it = exprs.rbegin(); it != exprs.rend(); ++it) {
      Expr* e = *it;
      Scope::Iterator fwd_it = std::prev(it.base());
      auto [node_it, inserted] = nodes_.try_emplace(e, &scope, fwd_it, parent);

tests/cpp/test_host_ir_passes.cpp

Priya2698 · 2026-02-27T01:00:36Z

!test

Priya2698 · 2026-02-27T23:46:44Z

!test

greptile-apps · 2026-02-27T23:48:17Z

Additional Comments (1)

tests/cpp/test_host_ir_passes.cpp, line 49
Missing recursive check for kir::IfThenElse expressions. The function checks ForLoop bodies but not IfThenElse then/else bodies. If allocations exist inside if-then-else blocks, they won't be validated for proper deallocation.

    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
      collectPersistentTensorViews(loop->body(), allocated);
    }
    if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
      collectPersistentTensorViews(ite->thenBody(), allocated);
      collectPersistentTensorViews(ite->elseBody(), allocated);
    }

wujingyue

Almost there. I'm reviewing the LCA part...

wujingyue · 2026-02-27T06:10:22Z

csrc/host_ir/allocate_and_deallocate.cpp

+// For each TensorView that is allocated or used as an input, find its
+// least common ancestor in the Post-dominator Tree — the latest point at which
+// it can be deallocated.
+std::unordered_map<TensorView*, const Node*> computeLeastCommonAncestor(


Suggested change

std::unordered_map<TensorView*, const Node*> computeLeastCommonAncestor(

std::unordered_map<TensorView*, const Node*> computeLowestCommonAncestor(

csrc/host_ir/allocate_and_deallocate.cpp

wujingyue

LGTM otherwise

csrc/host_ir/allocate_and_deallocate.cpp

wujingyue · 2026-02-28T00:39:58Z

csrc/host_ir/allocate_and_deallocate.cpp

-      }
+  PostDominatorTree post_dominator_tree(hic);
+  const std::unordered_map<TensorView*, const Node*>& lca_map =
+      computeLeastCommonAncestor(post_dominator_tree);


Consider wrapping this in a class so we don't have to expose std::unordered_map to the user.

lcas = LowestCommonAncestors(post_dominator_tree); lcas.getLca(tv);

What is the downside of exposing the lca_map?

wujingyue · 2026-02-28T00:42:58Z

csrc/host_ir/allocate_and_deallocate.cpp

+  std::unordered_map<const Node*, int64_t> depth;
+
+  auto findLCA = [&](const Node* a, const Node* b) -> const Node* {
+    if (a == nullptr) {
+      return b;
+    }
+    if (b == nullptr) {
+      return a;
+    }
+    int64_t depth_a = depth.at(a);
+    int64_t depth_b = depth.at(b);
+    while (depth_a > depth_b) {
+      a = a->parent();
+      depth_a--;
+    }
+    while (depth_b > depth_a) {
+      b = b->parent();
+      depth_b--;
+    }
+    while (a != b) {
+      a = a->parent();
+      b = b->parent();
+    }
+    return a;
+  };


Consider making it a private method of class LowestCommonAncestors, and making depth LowestCommonAncestors::depth_.

csrc/host_ir/allocate_and_deallocate.cpp

wujingyue · 2026-02-28T01:01:38Z

tests/cpp/test_host_ir_passes.cpp

+// Traverse the IR and collect all allocated Tensorviews and remove them when
+// a Deallocate is encountered.
+void collectPersistentTensorViews(
+    const Scope& scope,
+    std::unordered_set<TensorView*>& allocated) {
+  for (Expr* e : scope.exprs()) {
+    if (auto* dealloc = dynamic_cast<hir::Deallocate*>(e)) {
+      allocated.erase(dealloc->buffer());
+      continue;
+    }
+    if (auto* alloc = dynamic_cast<kir::Allocate*>(e)) {
+      allocated.insert(alloc->buffer()->as<TensorView>());
+      continue;
+    }
+    for (auto* tv : ir_utils::filterByType<TensorView>(e->inputs())) {
+      allocated.insert(tv);
+    }
+    for (auto* tv : ir_utils::filterByType<TensorView>(e->outputs())) {
+      allocated.insert(tv);
+    }
+    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
+      collectPersistentTensorViews(loop->body(), allocated);
+    }
+  }
+}
+
+void checkMemoryLeak(const hir::HostIrContainer& hic) {
+  std::unordered_set<TensorView*> allocated;
+  collectPersistentTensorViews(hic.topLevel(), allocated);
+  EXPECT_TRUE(std::all_of(
+      allocated.begin(),
+      allocated.end(),
+      [](TensorView* tv) {
+        return tv->isFusionInput() || tv->isFusionOutput();
+      }))
+      << "Some TensorViews allocated in IR are not deallocated and not fusion "
+         "inputs/outputs.";
+}


I understood your intention for better coverage but these helpers make the test less DAMP: https://testing.googleblog.com/2019/12/testing-on-toilet-tests-too-dry-make.html. Someone debugging a test has to also understand the logic of these two functions.

It may work better if we simply test the number of deallocations in the host IR container.

I see your point.

What do you think about testing the deallocation for the intermediate output -- that is, the number of deallocations in top_level?

When testing directly, I don't find it very obvious/upfront how many intermediates we will have that get a deallocate -- hence, the traversal based approach so I don't have to figure that out when reading the test.

Made-with: Cursor

greptile-apps · 2026-02-28T04:56:54Z

Additional Comments (1)

tests/cpp/test_host_ir_passes.cpp, line 49
also handle kir::IfThenElse to match the main code's scope traversal

    if (auto* loop = dynamic_cast<hir::ForLoop*>(e)) {
      collectPersistentTensorViews(loop->body(), allocated);
    }
    if (auto* ite = dynamic_cast<kir::IfThenElse*>(e)) {
      collectPersistentTensorViews(ite->thenBody(), allocated);
      collectPersistentTensorViews(ite->elseBody(), allocated);
    }

deallocate

f2ac6bb

deallocate for all tvs in host ir -- expr evaluated tvs may not be al…

678f6ba

…located when not inside for loop

Priya2698 commented Feb 24, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

Priya2698 commented Feb 24, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

Priya2698 mentioned this pull request Feb 24, 2026

hir::Allocate node #6000

Open

Priya2698 added 3 commits February 24, 2026 12:42

tighten the test, add conditions for deallocation

fd49660

bad commit

a92f4db

remove evaluate condition

e496482

Priya2698 commented Feb 24, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

Priya2698 added 2 commits February 24, 2026 13:30

clangtidy

8219fe4

move jit changes to separate PR to address clangtidy

0cd2f48

undo a change

aec9e85

Priya2698 marked this pull request as ready for review February 25, 2026 00:05

Priya2698 requested a review from wujingyue February 25, 2026 00:06

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

wujingyue reviewed Feb 25, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

post dominator

a338a1c

Priya2698 commented Feb 25, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Show resolved Hide resolved

missing comments

80a701c

decouple build and traversal of post dominator tree

d66d161

fix naming, extract dfs

dfd195c

update the tests

a16de22

remove unused imports

6bf14a6

Priya2698 commented Feb 27, 2026

View reviewed changes

tests/cpp/test_host_ir_passes.cpp Show resolved Hide resolved

Priya2698 added 2 commits February 26, 2026 16:58

narrow an import

d897e7c

remove comment

08a5999

Priya2698 requested a review from wujingyue February 27, 2026 01:00

tighten the test

52f73dc

wujingyue reviewed Feb 28, 2026

View reviewed changes

Priya2698 and others added 3 commits February 27, 2026 20:36

Merge branch 'main' into pm/deallocate

c72b706

review comments

b72eefb

Made-with: Cursor

review comments

2078dbc

	std::unordered_map<TensorView, const Node> computeLeastCommonAncestor(
	std::unordered_map<TensorView, const Node> computeLowestCommonAncestor(

Conversation

Priya2698 commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

greptile-apps bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 25, 2026

Uh oh!

greptile-apps bot commented Feb 26, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

Uh oh!

Priya2698 commented Feb 27, 2026

Uh oh!

Priya2698 commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

wujingyue Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Priya2698 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wujingyue Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Priya2698 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wujingyue Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Priya2698 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 24, 2026 •

edited

Loading

greptile-apps bot commented Feb 25, 2026 •

edited

Loading