-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permissibility of Internal Queues in Command Buffers #800
Comments
I don't think there's anything in the current spec that rules out an implementer from creating internal queues if they wanted to, so long as the visible effects to the user were as if the queue to However, this implementation detail sounds like something that's better suited to higher level APIs like CUDA and HIP. Have you considered using SYCL instead of OpenCL in your application? It might be at a closer level of abstraction to your other backends for this feature. I believe there are multiple vendors (including Codeplay) looking at adding graphs functionality to their implementations, e.g intel/llvm#5626 These questions are still very relevant though, as we'd like the OpenCL command-buffer extension to be able to support higher level graph APIs that could be layered on top. |
I would consider both HIP and CUDA to be at the same level as OpenCL, if not lower. (At least when using them as APIs rather than via nvcc/hipcc.) Indeed, I'll note that graphs are included as part of the lowest level CUDA driver API. For me SYLC is very much at the same level as Raja and Kokkos. Given our application is written in Python and does runtime code generation anything without a C API is not really on the cards. (We may in the future write a level zero backend, although the API with its focus on command lists, appears to have been designed by graphics people rather than compute people.) I think the key thing to remember here is that OpenCL does already have support for graphs. By combining out of order queues with events one can readily express most task graphs. The issue is that the graph is effectively rebuilt repeatedly at run-time. This results in extra API overhead (function calls, locking, and parameter validation), prevents the runtime from doing up-front scheduling, and causes a large number of events to be minted (which later need to be released). For example, on an RTX 2060 running a simple benchmark with CUDA graphs vs OpenCL graphs via an out of order queue and events:
(The inner loops are https://github.com/FreddieWitherden/PyFR/blob/feature/cugraph/pyfr/backends/cuda/types.py#L102 and https://github.com/FreddieWitherden/PyFR/blob/feature/cugraph/pyfr/backends/opencl/types.py#L105 respectively.) Unfortunately, as things stand the draft command buffer spec, although tantalisingly close, will not enable me to make a dent in this gap. |
This is an eye-catching point. To make sure we haven't missed something, am I right in understanding that to close this gap you feel we need to address all the limitations you've flagged with cl_khr_command_buffer:
Or is there is something in particular about the implementation detail of using internal queues which is crucial to closing the performance gap on the devices you are targetting? I suppose simple applications may not require features 1, 2, & 3, but will still benefit from 4. |
So 1. and 2. are the minimum that is needed.
|
One nice aspect of command buffers is that they can help facilitate simultaneous execution of kernels. Although this is theoretically possible with out of order queues, it tends to work poorly in practice, with vendor support for such queues being sub-par.
Given this a natural means of implementing command buffers is to scan the resulting graph and use this to see how much parallelism is available. Then, a number of internal (in order) queues can be created and, upon execution, kernels sent to these queues in a prescribed order. (This is how NVIDIA and AMD both implement their graph APIs under the hood with the execution graph having some number of internal streams attached to it. The AMD implementation is open source with https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/src/hip_graph_internal.cpp#L771 being a good starting point.) It is thus important that such a strategy can also be employed with command buffers.
A consequence, however, is that not all of the kernels in the buffer will go to the queue(s) specified in
clEnqueueCommandBufferKHR
. Additionally, it may have some interactions with future multi-queue applications for command buffers.The text was updated successfully, but these errors were encountered: