`insertDeallocate` inspects inner scopes by Priya2698 · Pull Request #6007 · NVIDIA/Fuser

Priya2698 · 2026-02-24T05:28:31Z

No description provided.

github-actions · 2026-02-24T05:29:28Z

Review updated until commit 80a701c

Description

Replace simple last-use deallocation with post-dominator tree analysis
Add PostDominatorTree class for precise deallocation point determination
Implement needsDeallocation function with comprehensive filtering logic
Add integration tests for host IR deallocation behavior

Changes walkthrough

Relevant files

Enhancement

allocate_and_deallocate.cpp `Post-dominator based deallocation analysis` csrc/host_ir/allocate_and_deallocate.cpp Add `PostDominatorTree` class for post-dominator analysis Replace `insertDeallocations` with post-dominator based approach Add `needsDeallocation` function with filtering for inputs/outputs/sharding Refactor `Node` class and `DominatorTree` implementation Update allocation insertion to use `insert_before` instead of iterator	+144/-75

Tests

test_host_ir_passes.cpp `Host IR deallocation integration tests` tests/cpp/test_host_ir_passes.cpp Add `HostIrPassesTest` fixture with host IR lowering enabled Test `TwoMatmulsInlinable` scenario with single deallocate in loop body Test `TwoMatmulsNotInlinable` scenario with top-level deallocate Verify deallocation count and placement in both scenarios	+145/-0

Configuration changes

CMakeLists.txt `Add new test file to build configuration` CMakeLists.txt Add `test_host_ir_passes.cpp` to `HOSTIR_TEST_SRCS` list	+1/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
C++20 Features Usage The code uses C++20 features like `<ranges>` and `std::views::reverse`. This may cause compilation issues on systems with older C++ standard support. Consider adding appropriate compile-time checks or providing fallback implementations. #include <ranges> Memory Management in PostDominatorTree The PostDominatorTree class stores Node objects directly in an unordered_map, but the findLCA method returns pointers to these nodes. Ensure that the lifetime and validity of these references are properly managed, especially when the unordered_map may rehash and invalidate references. const Node* findLCA(const Node* a, const Node* b) const { if (a == nullptr) { return b; } if (b == nullptr) { return a; } while (a->depth() > b->depth()) { a = a->parent(); } while (b->depth() > a->depth()) { b = b->parent(); } while (a != b) { a = a->parent(); b = b->parent(); } return a; } Error Handling Robustness The code assumes that LCA nodes will always be found for tensors that need deallocation. Consider adding more defensive error handling for edge cases where post-dominator analysis might fail to find appropriate insertion points. NVF_ERROR( lca_node != nullptr, "Could not find post-dominator for tensor ", tv); auto* deallocate = IrBuilder::create<hir::Deallocate>(tv); lca_node->scope()->insert_after(lca_node->getExpr(), deallocate); }

…located when not inside for loop

Priya2698 · 2026-02-24T05:44:45Z

!test

Priya2698 · 2026-02-24T05:45:41Z

csrc/host_ir/allocate_and_deallocate.cpp

+      /*pre_fn=*/
+      [&](const DominatorTree::Node* node) {
+        Expr* e = node->getExpr();
+        if (auto* alloc = dynamic_cast<kir::Allocate*>(e)) {


Temporary if. With hir::allocate, this can be captured with inputs().

csrc/host_ir/allocate_and_deallocate.cpp

Priya2698 · 2026-02-24T21:12:12Z

csrc/host_ir/allocate_and_deallocate.cpp

+  }
+  return true;
+}
+


This is not complete. Ops like view do not always allocate new tensors. Few things make that analysis tricky:

Aliasing information is not available in hic as it copies over from completeFusion. Question: Should we run an aliasing pass in host IR as well for expr-evaluated segments?

HostIrEvaluator for LoadStoreOp checks if the out_tv is known and either copies over the data or binds it to a view of the input. HostIrJit always creates a new tensor for ops like permute:

Fuser/csrc/host_ir/jit_external.cpp

Line 175 in bf8e1f6

void* permute_func_ptr = reinterpret_cast<void*>(

. Question: Is this just for simplicity for the first integration?

One solution can be to explicitly allocate expr eval outputs where needed like we do for matmul/linear. Then, we only deallocate tvs that are allocated.

The previous version did not make any distinction for view-like ops, so the functionality does not regress.

What do you think @wujingyue

Sorry, I'm missing some context. Can you remind me why this PR needs to change how we decide what needs deallocation? I understood the motivation of looking into loops but I'm missing some connections otherwise.

Can you remind me why this PR needs to change how we decide what needs deallocation?

This PR does not need to necessarily change this. But we do need to decide what needs deallocation since not all ops allocate new tensors.

I initially started with deallocating only explicitly "allocated" tensorviews. However, that breaks the HostIrJit tests where outputs of view/permute are also new allocated tensors.
If I deallocate everything, that includes outputs of ShardByStream which are not new tensors, rather slices (we could amend aliasing such that we mark these as aliases of their inputs). Hence, I am placing some minimum conditions on what needs deallocation for current use cases.

If I deallocate everything, that includes outputs of ShardByStream which are not new tensors

Got it. This is actually the old behavior. It didn't trigger this problem because ShardByStream is never top-level.

HostIrEvaluator handles deallocation by removing the tensor from the underlying hash table. It doesn't always free the memory. What problems did you run into with ShardByStream exactly?

I can try it myself tomorrow. Not on a computer right now

HostIrEvaluator handles deallocation by removing the tensor from the underlying hash table. It doesn't always free the memory.

Correct. I did not run into any errors with existing tests since handle(Deallocate*) only invalidates. But looking at the HostIrJit behavior, it actually deletes the tensor, hence, I avoided adding deallocation statements for those.

For simplicity, if you prefer, I can remove the additional conditions from this PR, and we can discuss that in a separate PR.

Priya2698 · 2026-02-24T21:43:41Z

!test

Priya2698 · 2026-02-24T21:46:19Z

!test

greptile-apps · 2026-02-25T00:09:18Z

Greptile Summary

Refactored insertDeallocations to use post-dominator tree analysis with LCA (Lowest Common Ancestor) to determine optimal deallocation placement in inner scopes, enabling more efficient memory management for inlined operations.

Key Changes:

Extracted Node class from DominatorTree and added depth tracking
Implemented PostDominatorTree class that builds tree in reverse order and computes LCA for tensor last-use analysis
Rewrote insertDeallocations to insert deallocations at the LCA node in the post-dominator tree instead of simple reverse iteration
Added needsDeallocation helper to filter fusion inputs/outputs and aliased buffers
Added comprehensive tests verifying correct deallocation placement for both inlinable and non-inlinable matmul scenarios

Issues Found:

Potential null pointer dereference in needsDeallocation when calling tv->definition()->isA<ShardByStream>() without null check

Confidence Score: 3/5

This PR introduces a critical null pointer dereference that needs to be fixed before merging
The refactoring is well-structured with comprehensive tests, but contains a null pointer dereference bug that could cause runtime crashes when processing tensors without definitions
Pay close attention to csrc/host_ir/allocate_and_deallocate.cpp line 269 for the null pointer dereference fix

Important Files Changed

Filename	Overview
CMakeLists.txt	Added `test_host_ir_passes.cpp` to build configuration
csrc/host_ir/allocate_and_deallocate.cpp	Refactored deallocation logic to use post-dominator tree and LCA analysis; potential null pointer dereference at line 269
tests/cpp/test_host_ir_passes.cpp	Added comprehensive tests for deallocation placement in inlinable and non-inlinable scenarios

_{Last reviewed commit: 80a701c}

greptile-apps

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-25T00:09:21Z

csrc/host_ir/allocate_and_deallocate.cpp

+    if (!needsDeallocation(allocated_tv)) {
+      continue;
+    }
+    const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);


.at() will throw if allocated_tv is not in last_use map. This can happen if a tensor is allocated (via kir::Allocate) but never used as input to any expression. In post_fn (line 240-248), last_use is only populated for allocated tensors and tensors used as inputs - but if an allocated tensor is never used as input, it won't be in last_use.

Suggested change

const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);

const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);

greptile-apps · 2026-02-25T00:09:22Z

csrc/host_ir/allocate_and_deallocate.cpp

+  if (tv->isFusionOutput()) {
+    return false;
+  }
+  if (tv->definition()->isA<ShardByStream>()) {


Add null check before dereferencing tv->definition(). While fusion inputs are filtered at line 198, there may be other cases where definition() is nullptr (e.g., intermediate values without definitions).

Suggested change

if (tv->definition()->isA<ShardByStream>()) {

if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {

wujingyue · 2026-02-25T20:47:43Z

csrc/host_ir/allocate_and_deallocate.cpp

+  }
+  return true;
+}
+


Sorry, I'm missing some context. Can you remind me why this PR needs to change how we decide what needs deallocation? I understood the motivation of looking into loops but I'm missing some connections otherwise.

wujingyue · 2026-02-25T20:54:38Z

csrc/host_ir/allocate_and_deallocate.cpp

-      if (in->isFusionOutput()) {
-        continue;
-      }
+  DominatorTree dom_tree(hic);


(I haven't yet read the code carefully so I could well be missing some important details.)

In compiler, there's a thing called post dominator tree, which could lead to a more general solution. Post dominator tree is similar to dominator tree but with the control flow reversed. IIUC, the position to deallocate a tv (if it needs to be deallocated) is the nearest post dominator of tv and all its uses. This can be computed as their lowest common ancestor (LCA) in the post dominator tree.

I WFH today but I'm happy to explain this in person tomorrow or Friday.

Can you take a look at the latest commit for the post-dominator tree version?

Priya2698 · 2026-02-25T23:22:16Z

csrc/host_ir/allocate_and_deallocate.cpp

 namespace {

-class DominatorTree {
+class Node {


TODO: create different nodes in each tree to avoid overloading too much.

greptile-apps · 2026-02-25T23:40:05Z

Additional Comments (1)

csrc/host_ir/allocate_and_deallocate.cpp, line 269
Add null check before dereferencing definition(). While fusion inputs/outputs are filtered above, a TensorView may not have a definition in other cases (e.g., allocated but unused tensors).

  if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {

deallocate

f2ac6bb

deallocate for all tvs in host ir -- expr evaluated tvs may not be al…

678f6ba

…located when not inside for loop

Priya2698 commented Feb 24, 2026

View reviewed changes

csrc/host_ir/allocate_and_deallocate.cpp Outdated Show resolved Hide resolved

Priya2698 mentioned this pull request Feb 24, 2026

hir::Allocate node #6000

Open

Priya2698 added 3 commits February 24, 2026 12:42

tighten the test, add conditions for deallocation

fd49660

bad commit

a92f4db

remove evaluate condition

e496482

Priya2698 commented Feb 24, 2026

View reviewed changes

Priya2698 added 2 commits February 24, 2026 13:30

clangtidy

8219fe4

move jit changes to separate PR to address clangtidy

0cd2f48

undo a change

aec9e85

Priya2698 marked this pull request as ready for review February 25, 2026 00:05

Priya2698 requested a review from wujingyue February 25, 2026 00:06

greptile-apps bot reviewed Feb 25, 2026

View reviewed changes

wujingyue reviewed Feb 25, 2026

View reviewed changes

post dominator

a338a1c

Priya2698 commented Feb 25, 2026

View reviewed changes

missing comments

80a701c

	const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);
	const DominatorTree::Node* last_use_node = last_use.at(allocated_tv);

	if (tv->definition()->isA<ShardByStream>()) {
	if (tv->definition() != nullptr && tv->definition()->isA<ShardByStream>()) {

Conversation

Priya2698 commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Priya2698 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

Priya2698 commented Feb 24, 2026

Uh oh!

greptile-apps bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 24, 2026 •

edited

Loading

Priya2698 Feb 24, 2026 •

edited

Loading

Priya2698 Feb 26, 2026 •

edited

Loading

greptile-apps bot commented Feb 25, 2026 •

edited

Loading