Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permissibility of Internal Queues in Command Buffers #800

Open
FreddieWitherden opened this issue May 16, 2022 · 4 comments
Open

Permissibility of Internal Queues in Command Buffers #800

FreddieWitherden opened this issue May 16, 2022 · 4 comments
Labels
cl_khr_command_buffer Relating to the command-buffer family of extension

Comments

@FreddieWitherden
Copy link

One nice aspect of command buffers is that they can help facilitate simultaneous execution of kernels. Although this is theoretically possible with out of order queues, it tends to work poorly in practice, with vendor support for such queues being sub-par.

Given this a natural means of implementing command buffers is to scan the resulting graph and use this to see how much parallelism is available. Then, a number of internal (in order) queues can be created and, upon execution, kernels sent to these queues in a prescribed order. (This is how NVIDIA and AMD both implement their graph APIs under the hood with the execution graph having some number of internal streams attached to it. The AMD implementation is open source with https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/src/hip_graph_internal.cpp#L771 being a good starting point.) It is thus important that such a strategy can also be employed with command buffers.

A consequence, however, is that not all of the kernels in the buffer will go to the queue(s) specified in clEnqueueCommandBufferKHR. Additionally, it may have some interactions with future multi-queue applications for command buffers.

@EwanC
Copy link
Contributor

EwanC commented May 17, 2022

I don't think there's anything in the current spec that rules out an implementer from creating internal queues if they wanted to, so long as the visible effects to the user were as if the queue to clEnqueueCommandBufferKHR was used and the returned event behaved as expected.

However, this implementation detail sounds like something that's better suited to higher level APIs like CUDA and HIP. Have you considered using SYCL instead of OpenCL in your application? It might be at a closer level of abstraction to your other backends for this feature. I believe there are multiple vendors (including Codeplay) looking at adding graphs functionality to their implementations, e.g intel/llvm#5626

These questions are still very relevant though, as we'd like the OpenCL command-buffer extension to be able to support higher level graph APIs that could be layered on top.

@FreddieWitherden
Copy link
Author

FreddieWitherden commented May 19, 2022

However, this implementation detail sounds like something that's better suited to higher level APIs like CUDA and HIP. Have you considered using SYCL instead of OpenCL in your application?

I would consider both HIP and CUDA to be at the same level as OpenCL, if not lower. (At least when using them as APIs rather than via nvcc/hipcc.) Indeed, I'll note that graphs are included as part of the lowest level CUDA driver API. For me SYLC is very much at the same level as Raja and Kokkos. Given our application is written in Python and does runtime code generation anything without a C API is not really on the cards. (We may in the future write a level zero backend, although the API with its focus on command lists, appears to have been designed by graphics people rather than compute people.)

I think the key thing to remember here is that OpenCL does already have support for graphs. By combining out of order queues with events one can readily express most task graphs. The issue is that the graph is effectively rebuilt repeatedly at run-time. This results in extra API overhead (function calls, locking, and parameter validation), prevents the runtime from doing up-front scheduling, and causes a large number of events to be minted (which later need to be released). For example, on an RTX 2060 running a simple benchmark with CUDA graphs vs OpenCL graphs via an out of order queue and events:

$ pyfr run -bcuda -p inc_cylinder_2d.pyfrm inc_cylinder_2d.ini
100.0% [===========================================================>] 75.00/75.00 ela: 00:01:41 rem: 00:00:00
$ pyfr run -bopencl -p inc_cylinder_2d.pyfrm inc_cylinder_2d.ini
100.0% [===========================================================>] 75.00/75.00 ela: 00:04:24 rem: 00:00:00

(The inner loops are https://github.com/FreddieWitherden/PyFR/blob/feature/cugraph/pyfr/backends/cuda/types.py#L102 and https://github.com/FreddieWitherden/PyFR/blob/feature/cugraph/pyfr/backends/opencl/types.py#L105 respectively.) Unfortunately, as things stand the draft command buffer spec, although tantalisingly close, will not enable me to make a dent in this gap.

@EwanC
Copy link
Contributor

EwanC commented Jun 6, 2022

Unfortunately, as things stand the draft command buffer spec, although tantalisingly close, will not enable me to make a dent in this gap.

This is an eye-catching point. To make sure we haven't missed something, am I right in understanding that to close this gap you feel we need to address all the limitations you've flagged with cl_khr_command_buffer:

  1. Host-side synchronization
  2. Updating kernel arguments
  3. Enabling a command-buffer to submit another command-buffer
  4. Allowing implementations to parallelize execution of the graph.

Or is there is something in particular about the implementation detail of using internal queues which is crucial to closing the performance gap on the devices you are targetting? I suppose simple applications may not require features 1, 2, & 3, but will still benefit from 4.

@FreddieWitherden
Copy link
Author

So 1. and 2. are the minimum that is needed.

  1. has a few uses, mostly around interacting with library code; it is cleaner for a library to mint a command buffer and only need to make a single call into that library rather than asking it to add its own kernels to your command buffers (of which you may have dozens). Our code uses the equivalent CUDA functionality (graph embedding) to handle cuBLAS: we set a stream into record mode, call cuBLAS, have the stream mint us a graph, and then embed this graph in our own master graph with dependencies.

  2. is very much a performance thing given that vendors don't do a great job with out of order queues (and static scheduling to multiple in-order queues is more efficient). It would also allow for good emulation of command buffers. Given a recording made for an out of order queue, emulation code can analyse the buffer, see how much parallelism there is, and from this mint some number of in-order queues to submit the kernels to when the buffer is run. The savings here are in events (if B depends on A and both are submitted to the same in order queue then no event is needed whereas one would be for an out of order queue), and also making the parallelism explicit to the runtime.

@EwanC EwanC added the cl_khr_command_buffer Relating to the command-buffer family of extension label Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cl_khr_command_buffer Relating to the command-buffer family of extension
Projects
None yet
Development

No branches or pull requests

2 participants