Methodology for Optimizing Accelerated FPGA Applications

6. Using Out-of-Order Queues and Multiple Compute Units

In the previous labs, you have concentrated on extracting parallelism within a kernel using techniques, such as pipelining and dataflow. One of the very powerful features of FPGAs is that you can create multiple compute units (CUs), which are identical copies of your kernel, allowing more processing to happen in parallel. These CUs can be used to process multiple images at the same time, or divide one image into smaller regions, so that you can process each frame faster. In this tutorial, you are going to take the latter approach to speed up computation of each individual frame.

To take advantage of acceleration potential offered by the multiple CUs, the host application needs to be able to issue and manage multiple concurrent requests to the CUs. For maximum performance, it is important to ensure that the application keeps all the CUs busy. Any delay in transferring data or starting a CU will reduce the overall performance.

In this lab, you will first implement changes in the host code to handle multiple CUs, then make updates to the kernel to handle subregions of a frame.

Executing Queued Operations Out-of-Order

The host application uses OpenCL™ APIs to communicate with kernels on an FPGA. Those commands are executed through a command queue object. By default, the command queue is handled in order; however, you can change this behavior to execute your operations in any order by passing a special flag to the command queue. This type of queue will execute whatever operation is ready to execute as soon as the resources are available.

Out-of-order queues allow you to launch multiple operations at the same time, including memory transfers and kernel calls. You can add dependencies on tasks using OpenCL API events and wait lists. Events are objects that are associated with a particular task. It is usually passed into a call as the last argument. If an operation depends on another task, you can pass the event into a wait list. The operation will need to wait for all events in the wait list to finish before executing.

TIP: The completed host code source file is provided under the ~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/reference-files/multicu folder. You can use it as a reference if needed.

To take advantage of the out-of-order queues and events, modify the host program.

Open the convolve.cpp file from ~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/src/multicu, and modify the following line.
```
 cl::CommandQueue q(context, device, cl::QueueProperties::Profiling);
```
to:
```
 cl::CommandQueue q(context, device, cl::QueueProperties::Profiling | cl::QueueProperties::OutOfOrder);
```
Passing cl::QueueProperties::OutOfOrder enum to the CommandQueue constructor tells the runtime that the operations on this queue can be executed out-of-order.
With an out-of-order queue, you must now enforce ordering between the read, enqueueTask, and write calls to make sure that you do not read the buffer before the copy operation has completed. You will create a cl::Event object, and pass it as the last argument of the enqueueWriteBuffer function. Change line 95 from:
```
 q.enqueueWriteBuffer(buffer_input, CL_FALSE, 0, frame_bytes, inFrame.data());
```
to:
```
 cl::Event write_event;
 q.enqueueWriteBuffer(buffer_input, CL_FALSE, 0, frame_bytes, inFrame.data(), nullptr, &write_event);
```
The write_event object will be used to enforce this operation's dependency on the next task.
You need to pass the write_event to the enqueueTask call. You must also create an event object for the task to pass to the read operation. The write_event object from the previous call must be passed into this call through a pointer to a vector. Modify the enqueueTask call in line 96 as follows.
```
 vector<cl::Event> iteration_events{write_event};
 cl::Event task_event;
 q.enqueueTask(convolve_kernel, &iteration_events, &task_event);
```
The read call needs to be executed after the convolve_kernel has finished executing. Just like in the previous operations, you can also send the event as the last argument of this function. Modify the enqueueReadBuffer call in line 97 as follows.
```
 iteration_events.push_back(task_event);
 cl::Event read_event;
 q.enqueueReadBuffer(buffer_output, CL_FALSE, 0, frame_bytes, outFrame.data(), &iteration_events, &read_event);
 iteration_events.push_back(read_event);
```
Here you added the task_event object to the end of the iteration_events vector. Then, you pass iteration_events in as the second to the last argument to the enqueueReadBuffer call. You could also have created a new vector because the enqueueTask call depends on the previous call.
You need to make sure that ffmpeg does not write to the the output stream before you transfer the data back to the host. You can block the thread from continuing by calling the wait call on the read_event object. Add this line after the push_back function call on the iteration_events object.
```
 read_event.wait();
```

Using Multiple Compute Units

In previous labs, only one CU is used for the kernel. In this section, you will modify the design to use multiple CUs, and each CU will process a smaller region of the image. To achieve that, you are going to make further modifications based on the output from the previous step.

TIP: The completed kernel source file is provided under the ~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/reference-files/multicu folder. You can use it as a reference if needed.

Here you are going to modify the kernel code. Open the convolve_fpga.cpp file from ~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/src/multicu, and make following modifications:

Modify the signature of the convolve_fpga kernel to accept the offset and number of lines each kernel will process (line 106).
```
        void convolve_fpga(const RGBPixel* inFrame, RGBPixel* outFrame, const float* coefficient, int coefficient_size, int img_width, int img_height, int line_offset, int num_lines) {
            ...
```
Depending on the image size and the number of CUs, you will divide the work evenly, and the offset will be used to determine the starting location of the kernel. The line_offset parameter is the first line that the CU will process. The num_lines argument will hold the number of lines processed by each CU.

TIP: Ensure the declaration of the convolve_fpga function in kernels.h matches with the convolve_fpga.cpp file.

Modify the main kernel, so that you can calculate the padding and offsets for each of the CUs to process (line 123).

     int half = COEFFICIENT_SIZE / 2;

     hls::stream<RGBPixel> read_stream("read");
     hls::stream<RGBPixel> write_stream("write");

     int elements = img_width * num_lines;
     int offset = std::max(0, line_offset - half) * img_width;
     int top_padding = 0;
     int bottom_padding = 0;
     int padding = 0;
     if(line_offset == 0) {
         top_padding = half * img_width;
     } else {
         padding = img_width *  half;
     }
     if(line_offset + num_lines < img_height) {
         padding += img_width * half + COEFFICIENT_SIZE;
     }else {
         bottom_padding = img_width * (half) + COEFFICIENT_SIZE;
     }

     #pragma HLS dataflow
     read_dataflow(read_stream, inFrame + offset, img_width, elements + padding, half, top_padding, bottom_padding);
     compute_dataflow(write_stream, read_stream, coefficient, img_width, elements, half);
     write_dataflow(outFrame + line_offset * img_width, write_stream, elements);

The offset variable is used to calculate the offsets from the beginning of the image to the first pixel that the CU will read.
The top_padding and bottom_padding variables will determine the padding of zeros to add to the top and the bottom of the image.
The padding variable, on the other hand, is the number of pixels to read including the regions around the convolution window.

Modify the read_dataflow kernel to send zeros for the padding areas for the top and the bottom of the image (line 20).

     void read_dataflow(hls::stream<RGBPixel>& read_stream, const RGBPixel * in, int img_width, int elements, int half, int top_padding, int bottom_padding) {
         while(top_padding--) {
             read_stream << zero;
         }
         int pixel = 0;
         while(elements--) {
             read_stream << in[pixel++];
         }
         while(bottom_padding--) {
             read_stream << zero;
         }
     }

Because you are handling the padding logic in the read_dataflow module, you can remove the initialization logic for zeroing out the padded area. Remove the following lines from compute_dataflow (line 45).
```
     while(line_idx < center) {
         for(int i = 0; i < img_width; i++) {
             window_mem[line_idx][i] = zero;
         }
             line_idx++;
     }
```

You still need to modify a few things on the host code side to launch multiple CUs in parallel.

Host Code Updates to Support Multiple CUs

The following steps need to be performed for supporting CUs.

Open convolve.cpp, and add the following lines before the frame_count for loop.
```
     int compute_units = 4;
     int lines_per_compute_unit = height / compute_units;
```
These variables define the number of CUs you will have in your binary. You then divide the lines of the image evenly between all of the CUs. This code assumes that you can evenly divide the image among the CUs.

Instead of launching one task, launch a task on each of the CUs you created. Modify the following code from:

     cl::Event task_event;
     q.enqueueTask(convolve_kernel, &iteration_events, &task_event);

to:

     vector<cl::Event> task_events;
     for(int cu = 0; cu < compute_units; cu++) {
         cl::Event task_event;
         convolve_kernel.setArg(6, cu * lines_per_compute_unit);
         convolve_kernel.setArg(7, lines_per_compute_unit);
         q.enqueueTask(convolve_kernel, &iteration_events, &task_event);
         task_events.push_back(task_event);
     }
     copy(begin(task_events), end(task_events), std::back_inserter(iteration_events));

This for loop will launch one task per CU. You will pass an event object to each of the tasks, and then add it to the task_events vector. Notice that you are not adding it to the iteration_events until after the end of the loop. This is because you only want the tasks to depend on the enqueueWriteBuffer call and not each other.

Now you can compile and run the design, and you should see results similar to the results below.

Run Hardware Emulation for Multiple Compute Units

Before running emulation, look at the Makefile again, and pay attention to line 150.
```
XOCCFLAGS += --nk convolve_fpga:$(CU_NUM)
```
Here the xocc --nk option is used to specify the number of kernel instances, or CUs, generated during the linking step of the build process. For this lab, CU_NUM is defined as 4.

Go to the makefile directory.

cd ~/SDAccel-AWS-F1-Developer-Labs/modules/module_03/design/makefile

Use the following command to run hardware emulation.

make run TARGET=hw_emu STEP=multicu SOLUTION=1 NUM_FRAMES=1

Here are the results of this kernel, running on four CUs.

Processed 0.08 MB in 42.810s (0.00 MBps)

INFO: [SDx-EM 22] [Wall clock time: 01:34, Emulation time: 0.102462 ms] Data transfer between kernel(s) and global memory(s)
convolve_fpga_1:m_axi_gmem1-DDR[0]          RD = 24.012 KB              WR = 0.000 KB        
convolve_fpga_1:m_axi_gmem2-DDR[0]          RD = 0.000 KB               WR = 20.000 KB       
convolve_fpga_1:m_axi_gmem3-DDR[0]          RD = 0.035 KB               WR = 0.000 KB        
convolve_fpga_2:m_axi_gmem1-DDR[0]          RD = 22.012 KB              WR = 0.000 KB        
convolve_fpga_2:m_axi_gmem2-DDR[0]          RD = 0.000 KB               WR = 20.000 KB       
convolve_fpga_2:m_axi_gmem3-DDR[0]          RD = 0.035 KB               WR = 0.000 KB        
convolve_fpga_3:m_axi_gmem1-DDR[0]          RD = 24.012 KB              WR = 0.000 KB        
convolve_fpga_3:m_axi_gmem2-DDR[0]          RD = 0.000 KB               WR = 20.000 KB       
convolve_fpga_3:m_axi_gmem3-DDR[0]          RD = 0.035 KB               WR = 0.000 KB        
convolve_fpga_4:m_axi_gmem1-DDR[0]          RD = 22.000 KB              WR = 0.000 KB        
convolve_fpga_4:m_axi_gmem2-DDR[0]          RD = 0.000 KB               WR = 20.000 KB       
convolve_fpga_4:m_axi_gmem3-DDR[0]          RD = 0.035 KB               WR = 0.000 KB

You can now perform four times more work in about the same amount of time. You transfer more data from global memory, but that is because each CU needs to read the surrounding padding lines.

Generate Reports for Multiple Compute Units

Use the following command to generate the Profile Summary report and Timeline Trace report.

make gen_report TARGET=hw_emu STEP=multicu

View Profile Summary Report for Hardware Emulation

Use the following command to view the Profile Summary report.

make view_prof_report TARGET=hw_emu STEP=multicu

The following figure shows the Profile Summary report. The kernel execution time for four CUs is around 0.067 ms each.

Here is the updated table.

Step	Image Size	Time (HW-EM)(ms)	Reads (KB)	Writes (KB)	Avg. Read (KB)	Avg. Write (KB)	BW (MBps)
baseline	512x10	10.807	344	20.0	0.004	0.004	1.9
localbuf	512x10	1.969 (5.48x)	21 (0.12x)	20.0	0.064	0.064	10
fixed-type data	512x10	0.46 (4.2x)	21	20.0	0.064	0.064	44
dataflow	512x10	0.057 (8x)	21	20.0	0.064	0.064	360
multi-CU	512x40*	0.067 (0.85x)	92 (4.3x)	80.0 (4x)	0.064	0.064	1222*

NOTE:

The multi-CU version processed four times of the data comparing to previous versions. Even if each CU's execution time does not change, four parallel CUs increase the system performance by almost four times.

This is calculated by 4x data/Time. Here the data transfer time is not accounted for, and you assume that the four CUs are executing in parallel. This is not as accurate as the hardware run, but you will use it as a reference for optimizations effectiveness.

Next Steps

In this step, you performed host code optimizations by using out-of-order command queue and executing multiple CUs. In the next step, you will have the application run the accelerator in hardware.

Return to Start of Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-CU.md

multi-CU.md

Methodology for Optimizing Accelerated FPGA Applications

6. Using Out-of-Order Queues and Multiple Compute Units

Executing Queued Operations Out-of-Order

Using Multiple Compute Units

Host Code Updates to Support Multiple CUs

Run Hardware Emulation for Multiple Compute Units

Generate Reports for Multiple Compute Units

View Profile Summary Report for Hardware Emulation

Next Steps

Files

multi-CU.md

Latest commit

History

multi-CU.md

File metadata and controls

Methodology for Optimizing Accelerated FPGA Applications

6. Using Out-of-Order Queues and Multiple Compute Units

Executing Queued Operations Out-of-Order

Using Multiple Compute Units

Host Code Updates to Support Multiple CUs

Run Hardware Emulation for Multiple Compute Units

Generate Reports for Multiple Compute Units

View Profile Summary Report for Hardware Emulation

Next Steps