Skip to content

Commit 8e21a1d

Browse files
[SYCL][Graph] Update doc for UR PR moving reset commands to a dedicated cmd-list
Update the design doc. Update the UR tag.
1 parent 571834c commit 8e21a1d

File tree

4 files changed

+91
-49
lines changed

4 files changed

+91
-49
lines changed

sycl/doc/design/CommandGraph.md

+89-41
Original file line numberDiff line numberDiff line change
@@ -224,59 +224,107 @@ there are no parameters to take a wait-list, and the only sync primitive
224224
returned is blocking on host.
225225

226226
In order to achieve the expected UR command-buffer enqueue semantics with Level
227-
Zero, the adapter implementation adds extra commands to the Level Zero
228-
command-list representing a UR command-buffer.
229-
230-
* Prefix - Commands added to the start of the L0 command-list by L0 adapter.
231-
* Suffix - Commands added to the end of the L0 command-list by L0 adapter.
232-
233-
These extra commands operate on L0 event synchronisation primitives, used by the
234-
command-list to interact with the external UR wait-list and UR return event
235-
required for the enqueue interface.
236-
237-
The `ur_exp_command_buffer_handle_t` class for this adapter contains a
238-
*SignalEvent* which signals the completion of the command-list in the suffix,
239-
and is reset in the prefix. This signal is detected by a new UR return event
240-
created on UR command-buffer enqueue.
241-
242-
There is also a *WaitEvent* used by the `ur_exp_command_buffer_handle_t` class
243-
in the prefix to wait on any dependencies passed in the enqueue wait-list.
244-
This WaitEvent is reset in the suffix.
245-
246-
A command-buffer is expected to be submitted multiple times. Consequently,
227+
Zero, the adapter implementation needs extra commands.
228+
229+
* Prefix - Commands added **before** the graph workload.
230+
* Suffix - Commands added **after** the graph workload.
231+
232+
These extra commands operate on L0 event synchronisation primitives,
233+
used by the command-list to interact with the external UR wait-list
234+
and UR return event required for the enqueue interface.
235+
Unlike the graph workload (i.e. commands needed to perform the graph workload)
236+
the external UR wait-list and UR return event are submission dependent,
237+
which mean they can change from one submission to the next.
238+
239+
For performance concerns, the command-list that will execute the graph
240+
workload is made only once (during the command-buffer finalization stage).
241+
This allows the adapter to save time when submitting the command-buffer,
242+
by executing only this command-list (i.e. without enqueuing any commands
243+
of the graph workload).
244+
245+
#### Prefix
246+
247+
The prefix's commands aim to:
248+
1. Handle the the list on events to wait on, which is passed by the runtime
249+
when the UR command-buffer enqueue function is called.
250+
As mentioned above, this list of events changes from one submission
251+
to the next.
252+
Consequently, managing this mutable dependency in the graph-workload
253+
command-list implies rebuilding the command-list for each submission
254+
(note that this can change with mutable command-list).
255+
To avoid the signifiant time penalty of rebuilding this potentially large
256+
command-list each time, we prefer to add an extra command handling the
257+
wait list into another command-list (*wait command-list*).
258+
This command-list consists of a single L0 command: a barrier that waits for
259+
dependencies passed by the wait-list and signals a signal
260+
called *WaitEvent* when the barrier is complete.
261+
This *WaitEvent* is defined in the `ur_exp_command_buffer_handle_t` class.
262+
In the front of the graph workload command list, an extra barrier command
263+
waiting for this event is added (when the command-buffer is created).
264+
This ensures that the graph workload does not start running before
265+
the dependencies to be completed.
266+
The *WaitEvent* event is reset in the suffix.
267+
268+
269+
2. Reset events associated with the command-buffer except the
270+
*WaitEvent* event.
271+
Indeed, L0 events needs to be explicitly reset by an API call
272+
(L0 command in our case).
273+
Since a command-buffer is expected to be submitted multiple times,
247274
we need to ensure that L0 events associated with graph commands have not
248275
been signaled by a previous execution. These events are therefore reset to the
249-
non-signaled state before running the actual graph associated commands. Note
276+
non-signaled state before running the graph-workload command-list. Note
250277
that this reset is performed in the prefix and not in the suffix to avoid
251278
additional synchronization w.r.t profiling data extraction.
252-
253-
If a command-buffer is about to be submitted to a queue with the profiling
254-
property enabled, an extra command that copies timestamps of L0 events
255-
associated with graph commands into a dedicated memory which is attached to the
256-
returned UR event. This memory stores the profiling information that
257-
corresponds to the current submission of the command-buffer.
258-
259-
![L0 command-buffer diagram](images/L0_UR_command-buffer-v3.jpg)
279+
We use a new command list (*reset command-list*) for performance concerns.
280+
Indeed:
281+
* This allows the *WaitEvent* to be signaled directly on the host if
282+
the waiting list is empty, thus avoiding the need to submit a command list.
283+
* Enqueuing a reset L0 command for all events in the command-buffer is time
284+
consumming, especially for large graphs.
285+
However, this task is not needed for every submission, but only once, when the
286+
command-buffer is fixed, i.e. when the command-buffer is finalized. The
287+
decorellation between the reset command-list and the wait command-list allow us to
288+
create and enqueue the reset commands when finalizing the command-buffer,
289+
and only create the wait command-list at submission.
290+
291+
This command list is consist of a reset command for each of the graph commands
292+
and another reset command for resetting the signal we use to signal the completion
293+
of the graph workload. This signal is called *SignalEvent* and is defined in
294+
in the `ur_exp_command_buffer_handle_t` class.
295+
296+
#### Suffix
297+
298+
The suffix's commands aim to:
299+
1) Handle the completion of the graph workload and signal
300+
an UR return event.
301+
Thus, at the end of the graph workload command-list a command, which
302+
signals the *SignalEvent*, is added (when the command-buffer is finalized).
303+
In an additional command-list (*signal command-list*), a barrier waiting for
304+
this event is also added.
305+
This barrier signals, in turn, the UR return event that has be defined by
306+
the runtime layer when calling the `urCommandBufferEnqueueExp` function.
307+
308+
2) Manage the profiling. If a command-buffer is about to be submitted to
309+
a queue with the profiling property enabled, an extra command that copies
310+
timestamps of L0 events associated with graph commands into a dedicated
311+
memory which is attached to the returned UR event.
312+
This memory stores the profiling information that corresponds to
313+
the current submission of the command-buffer.
314+
315+
![L0 command-buffer diagram](images/L0_UR_command-buffer-v5.jpg)
260316

261317
For a call to `urCommandBufferEnqueueExp` with an `event_list` *EL*,
262-
command-buffer *CB*, and return event *RE* our implementation has to submit two
263-
new command-lists for the above approach to work. One before
318+
command-buffer *CB*, and return event *RE* our implementation has to submit
319+
three new command-lists for the above approach to work. Two before
264320
the command-list with extra commands associated with *CB*, and the other
265-
after *CB*. These two new command-lists are retrieved from the UR queue, which
321+
after *CB*. These new command-lists are retrieved from the UR queue, which
266322
will likely reuse existing command-lists and only create a new one in the worst
267323
case.
268324

269-
The L0 command-list created on `urCommandBufferEnqueueExp` to execute **before**
270-
*CB* contains a single command. This command is a barrier on *EL* that signals
271-
*CB*'s *WaitEvent* when completed.
272-
273-
The L0 command-list created on `urCommandBufferEnqueueExp` to execute **after**
274-
*CB* also contains a single command. This command is a barrier on *CB*'s
275-
*SignalEvent* that signals *RE* when completed.
276-
277325
#### Drawbacks
278326

279-
There are two drawbacks of this approach to implementing UR command-buffers for
327+
There are three drawbacks of this approach to implementing UR command-buffers for
280328
Level Zero:
281329

282330
1. 3x the command-list resources are used, if there are many UR command-buffers in
Binary file not shown.
Loading

sycl/plugins/unified_runtime/CMakeLists.txt

+2-8
Original file line numberDiff line numberDiff line change
@@ -56,14 +56,8 @@ endif()
5656
if(SYCL_PI_UR_USE_FETCH_CONTENT)
5757
include(FetchContent)
5858

59-
set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
60-
# commit cfba9f160528018055881f1ccf9ab98ec59c963f
61-
# Merge: 0bb2cad8 db5c33b2
62-
# Author: Kenneth Benzie (Benie) <[email protected]>
63-
# Date: Wed Feb 14 11:17:21 2024 +0100
64-
# Merge pull request #1216 from igchor/umf_standalone
65-
# [UR] Remove UMF sources and use standalone UMF repo instead
66-
set(UNIFIED_RUNTIME_TAG cfba9f160528018055881f1ccf9ab98ec59c963f)
59+
set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git")
60+
set(UNIFIED_RUNTIME_TAG maxime/optim-command-buffer-submission)
6761

6862
if(SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO)
6963
set(UNIFIED_RUNTIME_REPO "${SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO}")

0 commit comments

Comments
 (0)