In the previous labs, you have concentrated on extracting parallelism within a kernel using techniques, such as pipelining and dataflow. One of the very powerful features of FPGAs is that you can create multiple compute units (CUs), which are identical copies of your kernel, allowing more processing to happen in parallel. These CUs can be used to process multiple images at the same time, or divide one image into smaller regions, so that you can process each frame faster. In this tutorial, you are going to take the latter approach to speed up computation of each individual frame.
To take advantage of acceleration potential offered by the multiple CUs, the host application needs to be able to issue and manage multiple concurrent requests to the CUs. For maximum performance, it is important to ensure that the application keeps all the CUs busy. Any delay in transferring data or starting a CU will reduce the overall performance.
In this lab, you will first implement changes in the host code to handle multiple CUs, then make updates to the kernel to handle subregions of a frame.
The host application uses OpenCL™ APIs to communicate with kernels on an FPGA. Those commands are executed through a command queue object. By default, the command queue is handled in order; however, you can change this behavior to execute your operations in any order by passing a special flag to the command queue. This type of queue will execute whatever operation is ready to execute as soon as the resources are available.
Out-of-order queues allow you to launch multiple operations at the same time, including memory transfers and kernel calls. You can add dependencies on tasks using OpenCL API events and wait lists. Events are objects that are associated with a particular task. It is usually passed into a call as the last argument. If an operation depends on another task, you can pass the event into a wait list. The operation will need to wait for all events in the wait list to finish before executing.
TIP: The completed host code source file is provided under the
folder. You can use it as a reference if needed.
To take advantage of the out-of-order queues and events, modify the host program.
Open the
file from~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/src/multicu
, and modify the following q(context, device, cl::QueueProperties::Profiling);
cl::CommandQueue q(context, device, cl::QueueProperties::Profiling | cl::QueueProperties::OutOfOrder);
enum to theCommandQueue
constructor tells the runtime that the operations on this queue can be executed out-of-order. -
With an out-of-order queue, you must now enforce ordering between the read, enqueueTask, and write calls to make sure that you do not read the buffer before the copy operation has completed. You will create a
object, and pass it as the last argument of theenqueueWriteBuffer
function. Change line 95 from:q.enqueueWriteBuffer(buffer_input, CL_FALSE, 0, frame_bytes,;
cl::Event write_event; q.enqueueWriteBuffer(buffer_input, CL_FALSE, 0, frame_bytes,, nullptr, &write_event);
object will be used to enforce this operation's dependency on the next task. -
You need to pass the
to theenqueueTask
call. You must also create an event object for the task to pass to the read operation. Thewrite_event
object from the previous call must be passed into this call through a pointer to a vector. Modify theenqueueTask
call in line 96 as follows.vector<cl::Event> iteration_events{write_event}; cl::Event task_event; q.enqueueTask(convolve_kernel, &iteration_events, &task_event);
The read call needs to be executed after the
has finished executing. Just like in the previous operations, you can also send the event as the last argument of this function. Modify theenqueueReadBuffer
call in line 97 as follows.iteration_events.push_back(task_event); cl::Event read_event; q.enqueueReadBuffer(buffer_output, CL_FALSE, 0, frame_bytes,, &iteration_events, &read_event); iteration_events.push_back(read_event);
Here you added the
object to the end of theiteration_events
vector. Then, you passiteration_events
in as the second to the last argument to theenqueueReadBuffer
call. You could also have created a new vector because theenqueueTask
call depends on the previous call. -
You need to make sure that
does not write to the the output stream before you transfer the data back to the host. You can block the thread from continuing by calling the wait call on theread_event
object. Add this line after thepush_back
function call on theiteration_events
In previous labs, only one CU is used for the kernel. In this section, you will modify the design to use multiple CUs, and each CU will process a smaller region of the image. To achieve that, you are going to make further modifications based on the output from the previous step.
TIP: The completed kernel source file is provided under the
folder. You can use it as a reference if needed.
Here you are going to modify the kernel code. Open the convolve_fpga.cpp
file from ~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/src/multicu
, and make following modifications:
Modify the signature of the
kernel to accept the offset and number of lines each kernel will process (line 106).void convolve_fpga(const RGBPixel* inFrame, RGBPixel* outFrame, const float* coefficient, int coefficient_size, int img_width, int img_height, int line_offset, int num_lines) { ...
Depending on the image size and the number of CUs, you will divide the work evenly, and the offset will be used to determine the starting location of the kernel. The
parameter is the first line that the CU will process. Thenum_lines
argument will hold the number of lines processed by each CU.TIP: Ensure the declaration of the
function inkernels.h
matches with theconvolve_fpga.cpp
file. -
Modify the main kernel, so that you can calculate the padding and offsets for each of the CUs to process (line 123).
int half = COEFFICIENT_SIZE / 2; hls::stream<RGBPixel> read_stream("read"); hls::stream<RGBPixel> write_stream("write"); int elements = img_width * num_lines; int offset = std::max(0, line_offset - half) * img_width; int top_padding = 0; int bottom_padding = 0; int padding = 0; if(line_offset == 0) { top_padding = half * img_width; } else { padding = img_width * half; } if(line_offset + num_lines < img_height) { padding += img_width * half + COEFFICIENT_SIZE; }else { bottom_padding = img_width * (half) + COEFFICIENT_SIZE; } #pragma HLS dataflow read_dataflow(read_stream, inFrame + offset, img_width, elements + padding, half, top_padding, bottom_padding); compute_dataflow(write_stream, read_stream, coefficient, img_width, elements, half); write_dataflow(outFrame + line_offset * img_width, write_stream, elements);
- The
variable is used to calculate the offsets from the beginning of the image to the first pixel that the CU will read. - The
variables will determine the padding of zeros to add to the top and the bottom of the image. - The
variable, on the other hand, is the number of pixels to read including the regions around the convolution window.
- The
Modify the read_dataflow kernel to send zeros for the padding areas for the top and the bottom of the image (line 20).
void read_dataflow(hls::stream<RGBPixel>& read_stream, const RGBPixel * in, int img_width, int elements, int half, int top_padding, int bottom_padding) { while(top_padding--) { read_stream << zero; } int pixel = 0; while(elements--) { read_stream << in[pixel++]; } while(bottom_padding--) { read_stream << zero; } }
Because you are handling the padding logic in the read_dataflow module, you can remove the initialization logic for zeroing out the padded area. Remove the following lines from compute_dataflow (line 45).
while(line_idx < center) { for(int i = 0; i < img_width; i++) { window_mem[line_idx][i] = zero; } line_idx++; }
You still need to modify a few things on the host code side to launch multiple CUs in parallel.
The following steps need to be performed for supporting CUs.
, and add the following lines before theframe_count
for compute_units = 4; int lines_per_compute_unit = height / compute_units;
These variables define the number of CUs you will have in your binary. You then divide the lines of the image evenly between all of the CUs. This code assumes that you can evenly divide the image among the CUs.
Instead of launching one task, launch a task on each of the CUs you created. Modify the following code from:
cl::Event task_event; q.enqueueTask(convolve_kernel, &iteration_events, &task_event);
vector<cl::Event> task_events; for(int cu = 0; cu < compute_units; cu++) { cl::Event task_event; convolve_kernel.setArg(6, cu * lines_per_compute_unit); convolve_kernel.setArg(7, lines_per_compute_unit); q.enqueueTask(convolve_kernel, &iteration_events, &task_event); task_events.push_back(task_event); } copy(begin(task_events), end(task_events), std::back_inserter(iteration_events));
loop will launch one task per CU. You will pass an event object to each of the tasks, and then add it to thetask_events
vector. Notice that you are not adding it to theiteration_events
until after the end of the loop. This is because you only want the tasks to depend on theenqueueWriteBuffer
call and not each other.
Now you can compile and run the design, and you should see results similar to the results below.
Before running emulation, look at the Makefile again, and pay attention to line 150.
XOCCFLAGS += --nk convolve_fpga:$(CU_NUM)
Here the
xocc --nk
option is used to specify the number of kernel instances, or CUs, generated during the linking step of the build process. For this lab, CU_NUM is defined as 4. -
Go to the
makefile ~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/makefile
Use the following command to run hardware emulation.
make run TARGET=hw_emu STEP=multicu SOLUTION=1 NUM_FRAMES=1
Here are the results of this kernel, running on four CUs.
Processed 0.08 MB in 42.810s (0.00 MBps)
INFO: [SDx-EM 22] [Wall clock time: 01:34, Emulation time: 0.102462 ms] Data transfer between kernel(s) and global memory(s)
convolve_fpga_1:m_axi_gmem1-DDR[0] RD = 24.012 KB WR = 0.000 KB
convolve_fpga_1:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_1:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
convolve_fpga_2:m_axi_gmem1-DDR[0] RD = 22.012 KB WR = 0.000 KB
convolve_fpga_2:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_2:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
convolve_fpga_3:m_axi_gmem1-DDR[0] RD = 24.012 KB WR = 0.000 KB
convolve_fpga_3:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_3:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
convolve_fpga_4:m_axi_gmem1-DDR[0] RD = 22.000 KB WR = 0.000 KB
convolve_fpga_4:m_axi_gmem2-DDR[0] RD = 0.000 KB WR = 20.000 KB
convolve_fpga_4:m_axi_gmem3-DDR[0] RD = 0.035 KB WR = 0.000 KB
You can now perform four times more work in about the same amount of time. You transfer more data from global memory, but that is because each CU needs to read the surrounding padding lines.
- Use the following command to generate the Profile Summary report and Timeline Trace report.
make gen_report TARGET=hw_emu STEP=multicu
- Use the following command to view the Profile Summary report.
make view_prof_report TARGET=hw_emu STEP=multicu
The following figure shows the Profile Summary report. The kernel execution time for four CUs is around 0.067 ms each.
Here is the updated table.
Step | Image Size | Time (HW-EM)(ms) | Reads (KB) | Writes (KB) | Avg. Read (KB) | Avg. Write (KB) | BW (MBps) |
baseline | 512x10 | 10.807 | 344 | 20.0 | 0.004 | 0.004 | 1.9 |
localbuf | 512x10 | 1.969 (5.48x) | 21 (0.12x) | 20.0 | 0.064 | 0.064 | 10 |
fixed-type data | 512x10 | 0.46 (4.2x) | 21 | 20.0 | 0.064 | 0.064 | 44 |
dataflow | 512x10 | 0.057 (8x) | 21 | 20.0 | 0.064 | 0.064 | 360 |
multi-CU | 512x40* | 0.067 (0.85x) | 92 (4.3x) | 80.0 (4x) | 0.064 | 0.064 | 1222* |
- The multi-CU version processed four times of the data comparing to previous versions. Even if each CU's execution time does not change, four parallel CUs increase the system performance by almost four times.
- This is calculated by 4x data/Time. Here the data transfer time is not accounted for, and you assume that the four CUs are executing in parallel. This is not as accurate as the hardware run, but you will use it as a reference for optimizations effectiveness.
In this step, you performed host code optimizations by using out-of-order command queue and executing multiple CUs. In the next step, you will have the application run the accelerator in hardware.
Copyright© 2019 Xilinx