perf: Reduce copy overhead in inline passes #689

ccam80 · 2026-01-02T23:12:24Z

PR description

This PR contains changes to the InlineWorker called during the InlineInlinables and InlineOverloads passes. The inline_ir method accepts a callee_ir FunctionIR object, which is deepcopied for safety before being mutated. The deepcopy operation is taxing in kernels with many large nested inlined device functions. Further information, including compile time results, is available in #688.

This PR contains two changes to the safety-copy portion of the inline_ir method and it's call site in inline_function.

The callee_ir object was copied once to create a mutable version, then that mutable copy was copied again to preserve the original. 4279abd saves a reference to the incoming callee_ir on entry, then copies this for the mutable copy, saving one copy operation. This affords a ~40% performance improvement for the same outcome.

+       # save a reference to the unmutated input callee_ir to         # return
+       callee_ir_original = callee_ir

        # Always copy the callee IR, it gets mutated
        def copy_ir(the_ir):
            kernel_copy = the_ir.copy()
            kernel_copy.blocks = {}
            for block_label, block in the_ir.blocks.items():
                new_block = copy.deepcopy(the_ir.blocks[block_label])
                kernel_copy.blocks[block_label] = new_block
            return kernel_copy

        callee_ir = copy_ir(callee_ir)

        # check that the contents of the callee IR is something that can be
        # inlined if a validator is present
        if self.validator is not None:
            self.validator(callee_ir)

-       # save an unmutated copy of the callee_ir to return
-       callee_ir_original = copy_ir(callee_ir)

In the InlineInlinables pass, inline_ir is called from the inline_function entry point. inline_function runs all untyped passes on the function to be inlined, generating a new callee_ir object per inline. The preserved unmutated callee_ir_original is returned, but never consumed. There is therefore no need to preserve the unmutated callee_ir object when entering from inline_function. 893d8a8 adds a preserve_ir=True argument to inline_ir and calls it with preserve_ir=False.

inline_ir is also called from the InlineOverload pass, which does keep and reuse callee_ir. This (and all future) entry points fall back to the default persist_ir=True argument, so the provided ir is not mutated.

 # Copy the IR if it should be preserved.
+       if preserve_ir:
            def copy_ir(the_ir):
                kernel_copy = the_ir.copy()
            kernel_copy.blocks = {}
            for block_label, block in the_ir.blocks.items():
                new_block = copy.deepcopy(the_ir.blocks[block_label])
                kernel_copy.blocks[block_label] = new_block
            return kernel_copy
            callee_ir = copy_ir(callee_ir)

Tests

No tests have been added, as I'm not sure what would change except for performance if this test regressed. All tests in the numba-cuda suite pass, as do regression tests for ir mutation in numba's test_ir_inlining suite.

… inline_ir

copy-pr-bot · 2026-01-02T23:12:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-02T23:14:11Z

Greptile Summary

Optimizes InlineWorker.inline_ir method to reduce compile time overhead by eliminating unnecessary deepcopy operations during function inlining
Adds preserve_ir parameter to conditionally control copying behavior based on whether the original IR needs to persist
Modifies inline_function to call inline_ir with preserve_ir=False since the preserved IR is never consumed in that context

Important Files Changed

Filename	Overview
numba_cuda/numba/cuda/core/inline_closurecall.py	Adds optional `preserve_ir` parameter to `inline_ir` method and reorganizes copying logic to avoid duplicate operations

Confidence score: 3/5

This PR has moderate risk due to changes in critical compilation infrastructure that could affect inlining correctness
Score reflects concerns about replacing copy.deepcopy() with Block.copy() which may not preserve all statement state and the lack of tests to verify correctness
Pay close attention to the copying logic changes and ensure thorough testing before merging

greptile-apps

Additional Comments (1)

numba_cuda/numba/cuda/core/inline_closurecall.py, line 357-365 (link)

style: Commented-out code should be removed before merging or a clear decision made about its necessity. Are you planning to remove this commented code block or is it intentionally left for further testing?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

This WIP PR aims to reduce copy overhead in CUDA inline passes by optimizing how the callee_ir object is copied during inlining operations. The changes focus on eliminating redundant copy operations and making copying conditional based on whether the original IR needs to be preserved.

Key Changes

Reordered logic to save a reference to the original callee_ir before copying, eliminating one redundant copy operation
Replaced copy.deepcopy() with Block.copy() for a lighter-weight copying approach
Added a preserve_ir parameter to make copying conditional, with preserve_ir=False when calling from inline_function where the callee_ir is freshly generated

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

numba_cuda/numba/cuda/core/inline_closurecall.py

greptile-apps · 2026-01-04T03:00:01Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

ccam80 · 2026-01-04T03:14:12Z

After finding some history on this issue in the core Numba repo, I have removed a replacement of the deepcopy() call with a shallow-copy operation. The PR now passes regression tests in Numba's test_ir_inlining.py. I have updated the PR description to match current behaviour.

ccam80 added 3 commits January 3, 2026 11:28

fix: Skip second copy of callee_ir in inline_ir

4279abd

perf: swap deepcopy for block.copy when copying IR in inline_ir

5698bad

perf: make copying callee_ir conditional on a preserve_ir argument in…

893d8a8

… inline_ir

Copilot AI review requested due to automatic review settings January 2, 2026 23:12

Copilot started reviewing on behalf of ccam80 January 2, 2026 23:12 View session

ccam80 mentioned this pull request Jan 2, 2026

[FEA] Reduce overhead in inline_closurecall.InlineWorker #688

Open

greptile-apps bot reviewed Jan 2, 2026

View reviewed changes

Copilot AI reviewed Jan 2, 2026

View reviewed changes

fix: revert edits to copy_ir internals

da90e89

ccam80 changed the title ~~[WIP] perf: Reduce copy overhead in inline passes~~ perf: Reduce copy overhead in inline passes Jan 5, 2026

ccam80 mentioned this pull request Jan 7, 2026

perf: Parsing and Codegen for a large CellML model takes 2 hours ccam80/cubie#431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Reduce copy overhead in inline passes #689

perf: Reduce copy overhead in inline passes #689

Uh oh!

ccam80 commented Jan 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 2, 2026

Uh oh!

greptile-apps bot commented Jan 2, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 4, 2026

Uh oh!

ccam80 commented Jan 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perf: Reduce copy overhead in inline passes #689

Are you sure you want to change the base?

perf: Reduce copy overhead in inline passes #689

Uh oh!

Conversation

ccam80 commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR description

Tests

Uh oh!

copy-pr-bot bot commented Jan 2, 2026

Uh oh!

greptile-apps bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Important Files Changed

Confidence score: 3/5

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 4, 2026

Greptile found no issues!

Uh oh!

ccam80 commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ccam80 commented Jan 2, 2026 •

edited

Loading

greptile-apps bot commented Jan 2, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

ccam80 commented Jan 4, 2026 •

edited

Loading