Skip to content

Latest commit

 

History

History
144 lines (100 loc) · 5.43 KB

File metadata and controls

144 lines (100 loc) · 5.43 KB

SYCL Academy

Exercise 7: Unified Shared Memory (extension)


In this exercise you will learn:

  • How to allocate memory a region of unified shared memory on a device.
  • How to perform copy and fill operations on unified shared memory regions.
  • How to enqueue SYCL kernels using unified shared memory.

Note the solution file for ComputeCpp is called solution.cpp, and for DPC++ is called solution_dpcpp.cpp currently.


The Intel extension for unified shared memory (USM) adds support for allocating and manipulating regions of unified shared memory in SYCL for devices which support it. This adds a new distinct memory model to SYCL alongside the existing one which uses buffers and accessors.

There are various levels of USM which can be supported by a device, which provide varying levels of functionality. In this exercise we are going to focus on Explicit USM, which allows you to allocate a persistent pointer to a region of USM on the device and copy to and from it from the host application.

However, USM does not provide any implicit data movement or data dependency analysis which you would ordinarily get with accessors, so any copies must be done manually by calling memcpy and dependencies must be manually defined using events.

1.)

As USM is an extension and not core to SYCL 1.2.1 you will have to include additional header file(s) to use it in you application.

If you are using ComputeCpp you must include the following header in files in this order:

#include <SYCL/experimental/usm_wrapper.h>
#include <CL/sycl.hpp>
#include <SYCL/experimental.hpp>

Note: if using ComputeCpp the USM API is available in the cl::sycl::experimental namespace, whereas if you are using DPC++ the USM API is available directly in the cl::sycl namespace.

2.) Check for the USM extension

The first thing you'll need to do when wanting to use USM is to check whether the device you're running on supports it. You can do this can querying the device for USM support using the get_info parameter info::device::usm_device_allocations, which will return true or false.

You could also try using this in a custom device selector to have your queue automatically find a device which supports USM.

3.) Allocate USM memory on the device

Once you have identified that the device supports USM you can go ahead and allocate memory on the device. To do this you call malloc_device, which takes a template parameter specifying the type you are allocating, and then as function parameters the size in elements and the queue you wish to allocate with. Note the queue you specify must be associated with the device you wish to allocate on.

Do this for both inputs and the output.

4.) Copy the inputs

Next you'll need to copy your input data to the USM memory regions you allocated for the inputs.

You can do this by calling the queue member function memcpy, which takes a destination pointer, source pointer, and a size in bytes.

Each of these will return an event, you should store this event to wait on it later.

Do this for both inputs.

5.) Initialize the output

Next you'll need to zero the values at the USM memory region you allocated for the output.

You can do this by calling the queue member function fill, which takes a pointer, a value, and the number of elements.

Again you should store the event returned so you can wait on it later.

5a.) Construct usm_wrapper objects (additional step required for ComputeCpp)

If you are using ComputeCpp you must follow this additional step, otherwise you can ignore it and move on.

You must wrap each of the pointers you have allocated in a usm_wrapper object, you do this by simply constructing a usm_wrapper and passing the pointer to the constructor.

Once you have done this move onto step 5, using these usm_wrapper objects in place of the raw pointers.

6.) Wait on the copy/fill to complete

Before you enqueue the kernel itself you must wait for the memcpy and fill operations to complete.

You can do this by calling depends_on for each of the event objects returned by memcpy and fill inside the command group of the kernel, passing in the handler object. This will create dependencies for the command group to wait on those operations to complete before enqueueing the kernel.

Note: if using ComputeCpp this member function is spelt experimental_depends_on.

7.) Enqueue a kernel

Now that all the memory is allocated, copied and initialized, you can enqueue a kernel to perform the computation.

When using USM there are shortcut member functions for the queue class that enqueue kernel functions directly without having to create a command group, since you don't need to create accessors.

To launch a kernel call the parallel_for member function of the queue and capture the pointers you have allocated (the usm_wrapper objects in the case of ComputeCpp).

In the kernel itself write a vector add operation that takes the value at the index of each input pointers, add them together, and assign the result to the value at the index of the output pointer.

The parallel_for function will also return an event you can use to wait for the kernel to complete before continuing.

8.) Copy the result back

Once the kernel has completed you can copy the result back, similarly to the inputs you can do this by calling the queue member function memcpy, and waiting on the event that is returned.