This exercise uses GPU specific features in order to gain good GPU performance. While code will run on other hardware, the local memory tiled solution may not give as good performance as it does on a GPU.
In this exercise you will learn how to cache global memory into local memory in tiles according to work-groups in order to compare the performance difference.
In order to get good performance, make sure that global memory accesses are coalesced. Local memory accesses do not have as large a penalty for uncoalesced memeory accesses.
Compile with
For DPC++
icpx -fsycl -fsycl-targets=<triple> -I../../Utilities/include source.cpp
Allocate local memory for the kernel function by creating a local_accessor
.
The local_accessor
is created by specifying a range
which is the number of
elements to allocate.
Note that local memory is allocated per work-group and each work-group can only access its own local memory.
Cache the global memory required for each work-group in local memory by reading
from global accessor
and then writing to the local accessor
.
Then write from the local accessor to the global memory output accessor in a coalesced fashion.
Compare the performance with local memory and without local memory.