-
Notifications
You must be signed in to change notification settings - Fork 54
perf: Reduce copy overhead in inline passes #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Greptile Summary
Important Files Changed
Confidence score: 3/5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
numba_cuda/numba/cuda/core/inline_closurecall.py, line 357-365 (link)style: Commented-out code should be removed before merging or a clear decision made about its necessity. Are you planning to remove this commented code block or is it intentionally left for further testing?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
1 file reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This WIP PR aims to reduce copy overhead in CUDA inline passes by optimizing how the callee_ir object is copied during inlining operations. The changes focus on eliminating redundant copy operations and making copying conditional based on whether the original IR needs to be preserved.
Key Changes
- Reordered logic to save a reference to the original
callee_irbefore copying, eliminating one redundant copy operation - Replaced
copy.deepcopy()withBlock.copy()for a lighter-weight copying approach - Added a
preserve_irparameter to make copying conditional, withpreserve_ir=Falsewhen calling frominline_functionwhere the callee_ir is freshly generated
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
After finding some history on this issue in the core Numba repo, I have removed a replacement of the |
PR description
This PR contains changes to the InlineWorker called during the
InlineInlinablesandInlineOverloadspasses. Theinline_irmethod accepts acallee_irFunctionIR object, which is deepcopied for safety before being mutated. The deepcopy operation is taxing in kernels with many large nested inlined device functions. Further information, including compile time results, is available in #688.This PR contains two changes to the safety-copy portion of the
inline_irmethod and it's call site ininline_function.callee_irobject was copied once to create a mutable version, then that mutable copy was copied again to preserve the original. 4279abd saves a reference to the incomingcallee_iron entry, then copies this for the mutable copy, saving one copy operation. This affords a ~40% performance improvement for the same outcome.InlineInlinablespass,inline_iris called from theinline_functionentry point.inline_functionruns all untyped passes on the function to be inlined, generating a newcallee_irobject per inline. The preserved unmutatedcallee_ir_originalis returned, but never consumed. There is therefore no need to preserve the unmutatedcallee_irobject when entering frominline_function. 893d8a8 adds apreserve_ir=Trueargument toinline_irand calls it withpreserve_ir=False.inline_iris also called from theInlineOverloadpass, which does keep and reusecallee_ir. This (and all future) entry points fall back to the defaultpersist_ir=Trueargument, so the provided ir is not mutated.# Copy the IR if it should be preserved. + if preserve_ir: def copy_ir(the_ir): kernel_copy = the_ir.copy() kernel_copy.blocks = {} for block_label, block in the_ir.blocks.items(): new_block = copy.deepcopy(the_ir.blocks[block_label]) kernel_copy.blocks[block_label] = new_block return kernel_copy callee_ir = copy_ir(callee_ir)Tests
No tests have been added, as I'm not sure what would change except for performance if this test regressed. All tests in the numba-cuda suite pass, as do regression tests for ir mutation in numba's test_ir_inlining suite.