Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel optimisations #19

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Kernel optimisations #19

wants to merge 15 commits into from

Conversation

dimitrivlachos
Copy link
Collaborator

Optimise CUDA kernels with shared and constant memory

Up until now, we have been using a naive memory access pattern in our
CUDA kernels, relying on global memory and automatic caching to store
and retrieve data. While this was functional and fairly performant, we
can improve kernel efficiency by utilising CUDA's optimised methods for
managing memory access.

This refactors all CUDA kernels to:

  • Use shared memory to chunk data into smaller, shared blocks.
  • Use constant memory to store frequently used, constant data in an
    aggressively cached state.

These changes reduce kernel runtime and memory usage, optimising
performance and allowing us to focus on other areas of the application.

Create a new common header (device_common) for device-specific
functions, structs, and constants. This improves readability and eases
maintenance for repetitive tasks.

Enable multi-file CUDA compilation to allow global constants to be
shared across files.

Closes #12

@dimitrivlachos dimitrivlachos added the enhancement New feature or request label Dec 2, 2024
@dimitrivlachos dimitrivlachos self-assigned this Dec 2, 2024
@dimitrivlachos dimitrivlachos marked this pull request as draft December 3, 2024 11:53
@dimitrivlachos dimitrivlachos force-pushed the kernel_optimisations branch 3 times, most recently from 82573cb to 703db3b Compare December 6, 2024 15:21
@dimitrivlachos dimitrivlachos marked this pull request as ready for review December 6, 2024 15:39
@dimitrivlachos
Copy link
Collaborator Author

Profiling provides the following insights:

Dispersion

Dispersion runs ~15% faster.
image

However, occupancy has been reduced due to additional register usage
image

Extended Dispersion

extended_dispersion runs ~13% faster

Baseline Optimised
image image

It is interesting to note that erosion is actually slower using these optimisations. So it may be worth reverting these changes before merging, though its runtime is still very short by comparison.

Both the first and second passes benefit largely from these changes. The first pass suffers from the same occupancy drop as one-pass dispersion, however, the second pass does not, likely owing to the fact that regular dispersion and the first pass of extended dispersion utilise a lot of shared code. So addressing this should address both kernels.

Summary

These changes provide a roughly 15% decrease in kernel execution time, now that we are using more optimal memory accesses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficient memory access patterns in CUDA kernels
1 participant