Permissibility of Internal Queues in Command Buffers #800

FreddieWitherden · 2022-05-16T12:16:25Z

One nice aspect of command buffers is that they can help facilitate simultaneous execution of kernels. Although this is theoretically possible with out of order queues, it tends to work poorly in practice, with vendor support for such queues being sub-par.

Given this a natural means of implementing command buffers is to scan the resulting graph and use this to see how much parallelism is available. Then, a number of internal (in order) queues can be created and, upon execution, kernels sent to these queues in a prescribed order. (This is how NVIDIA and AMD both implement their graph APIs under the hood with the execution graph having some number of internal streams attached to it. The AMD implementation is open source with https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/src/hip_graph_internal.cpp#L771 being a good starting point.) It is thus important that such a strategy can also be employed with command buffers.

A consequence, however, is that not all of the kernels in the buffer will go to the queue(s) specified in clEnqueueCommandBufferKHR. Additionally, it may have some interactions with future multi-queue applications for command buffers.

The text was updated successfully, but these errors were encountered:

EwanC · 2022-05-17T15:29:57Z

I don't think there's anything in the current spec that rules out an implementer from creating internal queues if they wanted to, so long as the visible effects to the user were as if the queue to clEnqueueCommandBufferKHR was used and the returned event behaved as expected.

However, this implementation detail sounds like something that's better suited to higher level APIs like CUDA and HIP. Have you considered using SYCL instead of OpenCL in your application? It might be at a closer level of abstraction to your other backends for this feature. I believe there are multiple vendors (including Codeplay) looking at adding graphs functionality to their implementations, e.g intel/llvm#5626

These questions are still very relevant though, as we'd like the OpenCL command-buffer extension to be able to support higher level graph APIs that could be layered on top.

FreddieWitherden · 2022-05-19T12:57:15Z

However, this implementation detail sounds like something that's better suited to higher level APIs like CUDA and HIP. Have you considered using SYCL instead of OpenCL in your application?

I would consider both HIP and CUDA to be at the same level as OpenCL, if not lower. (At least when using them as APIs rather than via nvcc/hipcc.) Indeed, I'll note that graphs are included as part of the lowest level CUDA driver API. For me SYLC is very much at the same level as Raja and Kokkos. Given our application is written in Python and does runtime code generation anything without a C API is not really on the cards. (We may in the future write a level zero backend, although the API with its focus on command lists, appears to have been designed by graphics people rather than compute people.)

I think the key thing to remember here is that OpenCL does already have support for graphs. By combining out of order queues with events one can readily express most task graphs. The issue is that the graph is effectively rebuilt repeatedly at run-time. This results in extra API overhead (function calls, locking, and parameter validation), prevents the runtime from doing up-front scheduling, and causes a large number of events to be minted (which later need to be released). For example, on an RTX 2060 running a simple benchmark with CUDA graphs vs OpenCL graphs via an out of order queue and events:

$ pyfr run -bcuda -p inc_cylinder_2d.pyfrm inc_cylinder_2d.ini
100.0% [===========================================================>] 75.00/75.00 ela: 00:01:41 rem: 00:00:00
$ pyfr run -bopencl -p inc_cylinder_2d.pyfrm inc_cylinder_2d.ini
100.0% [===========================================================>] 75.00/75.00 ela: 00:04:24 rem: 00:00:00

(The inner loops are https://github.com/FreddieWitherden/PyFR/blob/feature/cugraph/pyfr/backends/cuda/types.py#L102 and https://github.com/FreddieWitherden/PyFR/blob/feature/cugraph/pyfr/backends/opencl/types.py#L105 respectively.) Unfortunately, as things stand the draft command buffer spec, although tantalisingly close, will not enable me to make a dent in this gap.

EwanC · 2022-06-06T14:44:47Z

Unfortunately, as things stand the draft command buffer spec, although tantalisingly close, will not enable me to make a dent in this gap.

This is an eye-catching point. To make sure we haven't missed something, am I right in understanding that to close this gap you feel we need to address all the limitations you've flagged with cl_khr_command_buffer:

Host-side synchronization
Updating kernel arguments
Enabling a command-buffer to submit another command-buffer
Allowing implementations to parallelize execution of the graph.

Or is there is something in particular about the implementation detail of using internal queues which is crucial to closing the performance gap on the devices you are targetting? I suppose simple applications may not require features 1, 2, & 3, but will still benefit from 4.

FreddieWitherden · 2022-06-06T21:56:48Z

So 1. and 2. are the minimum that is needed.

has a few uses, mostly around interacting with library code; it is cleaner for a library to mint a command buffer and only need to make a single call into that library rather than asking it to add its own kernels to your command buffers (of which you may have dozens). Our code uses the equivalent CUDA functionality (graph embedding) to handle cuBLAS: we set a stream into record mode, call cuBLAS, have the stream mint us a graph, and then embed this graph in our own master graph with dependencies.
is very much a performance thing given that vendors don't do a great job with out of order queues (and static scheduling to multiple in-order queues is more efficient). It would also allow for good emulation of command buffers. Given a recording made for an out of order queue, emulation code can analyse the buffer, see how much parallelism there is, and from this mint some number of in-order queues to submit the kernels to when the buffer is run. The savings here are in events (if B depends on A and both are submitted to the same in order queue then no event is needed whereas one would be for an out of order queue), and also making the parallelism explicit to the runtime.

EwanC added the cl_khr_command_buffer Relating to the command-buffer family of extension label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permissibility of Internal Queues in Command Buffers #800

Permissibility of Internal Queues in Command Buffers #800

FreddieWitherden commented May 16, 2022

EwanC commented May 17, 2022

FreddieWitherden commented May 19, 2022 •

edited

Loading

EwanC commented Jun 6, 2022

FreddieWitherden commented Jun 6, 2022

Permissibility of Internal Queues in Command Buffers #800

Permissibility of Internal Queues in Command Buffers #800

Comments

FreddieWitherden commented May 16, 2022

EwanC commented May 17, 2022

FreddieWitherden commented May 19, 2022 • edited Loading

EwanC commented Jun 6, 2022

FreddieWitherden commented Jun 6, 2022

FreddieWitherden commented May 19, 2022 •

edited

Loading