Kernel optimisations #19

dimitrivlachos · 2024-12-02T16:33:21Z

Optimise CUDA kernels with shared and constant memory

Up until now, we have been using a naive memory access pattern in our
CUDA kernels, relying on global memory and automatic caching to store
and retrieve data. While this was functional and fairly performant, we
can improve kernel efficiency by utilising CUDA's optimised methods for
managing memory access.

This refactors all CUDA kernels to:

Use shared memory to chunk data into smaller, shared blocks.
Use constant memory to store frequently used, constant data in an
aggressively cached state.

These changes reduce kernel runtime and memory usage, optimising
performance and allowing us to focus on other areas of the application.

Create a new common header (device_common) for device-specific
functions, structs, and constants. This improves readability and eases
maintenance for repetitive tasks.

Enable multi-file CUDA compilation to allow global constants to be
shared across files.

Closes #12

dimitrivlachos · 2024-12-06T16:34:40Z

Profiling provides the following insights:

Dispersion

Dispersion runs ~15% faster.

However, occupancy has been reduced due to additional register usage

Extended Dispersion

extended_dispersion runs ~13% faster

Baseline	Optimised

It is interesting to note that erosion is actually slower using these optimisations. So it may be worth reverting these changes before merging, though its runtime is still very short by comparison.

Both the first and second passes benefit largely from these changes. The first pass suffers from the same occupancy drop as one-pass dispersion, however, the second pass does not, likely owing to the fact that regular dispersion and the first pass of extended dispersion utilise a lot of shared code. So addressing this should address both kernels.

Summary

These changes provide a roughly 15% decrease in kernel execution time, now that we are using more optimal memory accesses.

dimitrivlachos added the enhancement New feature or request label Dec 2, 2024

dimitrivlachos requested a review from ndevenish December 2, 2024 16:33

dimitrivlachos self-assigned this Dec 2, 2024

dimitrivlachos marked this pull request as draft December 3, 2024 11:53

dimitrivlachos force-pushed the kernel_optimisations branch 3 times, most recently from 82573cb to 703db3b Compare December 6, 2024 15:21

dimitrivlachos marked this pull request as ready for review December 6, 2024 15:39

dimitrivlachos force-pushed the kernel_optimisations branch from 703db3b to 77c0ad4 Compare December 9, 2024 15:26

dimitrivlachos added 15 commits December 13, 2024 14:53

Add device_common.cuh for common device code

2061c4b

Create common struct to contain kernel constant values

f005d26

Set property to allow for multi-file global device variables

5c249d6

Simplify verbose function parameters

2782251

Refactor to utilise global constant memory

e03ffaf

Add struct to facilitate pitched array access

fa64899

Refactor to utilise PitchedArray2D

f4d2e03

Add shared memory loading logic

337204a

Add function to calculate required shared memory for kernels

411ca8a

Calculate required shared memory dynamically

9e524fb

Remove unused shared memory parameter

2824ab4

Refactor kernels to use shared memory

91d0727

Update file header

498c5cb

Add type validation helpers for load_halo to ensure valid tuple types

0b028f0

Force loop unrolling to optimise register usage

aed24a1

dimitrivlachos force-pushed the kernel_optimisations branch from 77c0ad4 to aed24a1 Compare December 13, 2024 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel optimisations #19

Kernel optimisations #19

dimitrivlachos commented Dec 2, 2024

dimitrivlachos commented Dec 6, 2024

Kernel optimisations #19

Are you sure you want to change the base?

Kernel optimisations #19

Conversation

dimitrivlachos commented Dec 2, 2024

dimitrivlachos commented Dec 6, 2024

Profiling provides the following insights:

Dispersion

Extended Dispersion

Summary