Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Collection of L0 performance improvements #360

Draft
wants to merge 18 commits into
base: sycl
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
8e21a1d
[SYCL][Graph] Update doc for UR PR moving reset commands to a dedicat…
mfrancepillois Feb 20, 2024
ed730fe
Merge branch 'sycl' into maxime/UR-improve-ZE-enqueue-delay
mfrancepillois Feb 20, 2024
8c5dea5
[SYCL][Graph] Enable in-order cmd-list
mfrancepillois Feb 23, 2024
f671539
Add test for cehcking profiling when in-order command-list enabled
mfrancepillois Feb 26, 2024
15f02c0
Change property to `enable_profiling` + typo
mfrancepillois Feb 27, 2024
0527464
Propagate enableProfiling property to UR.
mfrancepillois Feb 27, 2024
b55a301
Update spec
mfrancepillois Feb 27, 2024
d885aa0
Update sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
mfrancepillois Feb 28, 2024
b922a77
Add exception throwing for all backends if enable_profiling property …
mfrancepillois Feb 28, 2024
c084e19
Typos
mfrancepillois Feb 28, 2024
49630d8
Update sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
mfrancepillois Feb 29, 2024
e4ee57e
Move test to unitest + typo
mfrancepillois Feb 29, 2024
0318696
Pass prop-list to `executable_command_graph` constructor + typos
mfrancepillois Mar 6, 2024
3fba2c7
Merge branch 'maxime/UR-improve-ZE-enqueue-delay' into maxime/l0-perf…
mfrancepillois Mar 6, 2024
94186aa
Merge branch 'maxime/single-path-graph-optimization' into maxime/l0-p…
mfrancepillois Mar 6, 2024
b233136
Update UR tag
mfrancepillois Mar 6, 2024
6079ab1
Add define to force imm-cmd-list
mfrancepillois Mar 7, 2024
3ea8d02
Merge branch 'sycl' into maxime/l0-perf-improvement
mfrancepillois Mar 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 89 additions & 41 deletions sycl/doc/design/CommandGraph.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,59 +250,107 @@ there are no parameters to take a wait-list, and the only sync primitive
returned is blocking on host.

In order to achieve the expected UR command-buffer enqueue semantics with Level
Zero, the adapter implementation adds extra commands to the Level Zero
command-list representing a UR command-buffer.

* Prefix - Commands added to the start of the L0 command-list by L0 adapter.
* Suffix - Commands added to the end of the L0 command-list by L0 adapter.

These extra commands operate on L0 event synchronisation primitives, used by the
command-list to interact with the external UR wait-list and UR return event
required for the enqueue interface.

The `ur_exp_command_buffer_handle_t` class for this adapter contains a
*SignalEvent* which signals the completion of the command-list in the suffix,
and is reset in the prefix. This signal is detected by a new UR return event
created on UR command-buffer enqueue.

There is also a *WaitEvent* used by the `ur_exp_command_buffer_handle_t` class
in the prefix to wait on any dependencies passed in the enqueue wait-list.
This WaitEvent is reset in the suffix.

A command-buffer is expected to be submitted multiple times. Consequently,
Zero, the adapter implementation needs extra commands.

* Prefix - Commands added **before** the graph workload.
* Suffix - Commands added **after** the graph workload.

These extra commands operate on L0 event synchronisation primitives,
used by the command-list to interact with the external UR wait-list
and UR return event required for the enqueue interface.
Unlike the graph workload (i.e. commands needed to perform the graph workload)
the external UR wait-list and UR return event are submission dependent,
which mean they can change from one submission to the next.

For performance concerns, the command-list that will execute the graph
workload is made only once (during the command-buffer finalization stage).
This allows the adapter to save time when submitting the command-buffer,
by executing only this command-list (i.e. without enqueuing any commands
of the graph workload).

#### Prefix

The prefix's commands aim to:
1. Handle the the list on events to wait on, which is passed by the runtime
when the UR command-buffer enqueue function is called.
As mentioned above, this list of events changes from one submission
to the next.
Consequently, managing this mutable dependency in the graph-workload
command-list implies rebuilding the command-list for each submission
(note that this can change with mutable command-list).
To avoid the signifiant time penalty of rebuilding this potentially large
command-list each time, we prefer to add an extra command handling the
wait list into another command-list (*wait command-list*).
This command-list consists of a single L0 command: a barrier that waits for
dependencies passed by the wait-list and signals a signal
called *WaitEvent* when the barrier is complete.
This *WaitEvent* is defined in the `ur_exp_command_buffer_handle_t` class.
In the front of the graph workload command list, an extra barrier command
waiting for this event is added (when the command-buffer is created).
This ensures that the graph workload does not start running before
the dependencies to be completed.
The *WaitEvent* event is reset in the suffix.


2. Reset events associated with the command-buffer except the
*WaitEvent* event.
Indeed, L0 events needs to be explicitly reset by an API call
(L0 command in our case).
Since a command-buffer is expected to be submitted multiple times,
we need to ensure that L0 events associated with graph commands have not
been signaled by a previous execution. These events are therefore reset to the
non-signaled state before running the actual graph associated commands. Note
non-signaled state before running the graph-workload command-list. Note
that this reset is performed in the prefix and not in the suffix to avoid
additional synchronization w.r.t profiling data extraction.

If a command-buffer is about to be submitted to a queue with the profiling
property enabled, an extra command that copies timestamps of L0 events
associated with graph commands into a dedicated memory which is attached to the
returned UR event. This memory stores the profiling information that
corresponds to the current submission of the command-buffer.

![L0 command-buffer diagram](images/L0_UR_command-buffer-v3.jpg)
We use a new command list (*reset command-list*) for performance concerns.
Indeed:
* This allows the *WaitEvent* to be signaled directly on the host if
the waiting list is empty, thus avoiding the need to submit a command list.
* Enqueuing a reset L0 command for all events in the command-buffer is time
consumming, especially for large graphs.
However, this task is not needed for every submission, but only once, when the
command-buffer is fixed, i.e. when the command-buffer is finalized. The
decorellation between the reset command-list and the wait command-list allow us to
create and enqueue the reset commands when finalizing the command-buffer,
and only create the wait command-list at submission.

This command list is consist of a reset command for each of the graph commands
and another reset command for resetting the signal we use to signal the completion
of the graph workload. This signal is called *SignalEvent* and is defined in
in the `ur_exp_command_buffer_handle_t` class.

#### Suffix

The suffix's commands aim to:
1) Handle the completion of the graph workload and signal
an UR return event.
Thus, at the end of the graph workload command-list a command, which
signals the *SignalEvent*, is added (when the command-buffer is finalized).
In an additional command-list (*signal command-list*), a barrier waiting for
this event is also added.
This barrier signals, in turn, the UR return event that has be defined by
the runtime layer when calling the `urCommandBufferEnqueueExp` function.

2) Manage the profiling. If a command-buffer is about to be submitted to
a queue with the profiling property enabled, an extra command that copies
timestamps of L0 events associated with graph commands into a dedicated
memory which is attached to the returned UR event.
This memory stores the profiling information that corresponds to
the current submission of the command-buffer.

![L0 command-buffer diagram](images/L0_UR_command-buffer-v5.jpg)

For a call to `urCommandBufferEnqueueExp` with an `event_list` *EL*,
command-buffer *CB*, and return event *RE* our implementation has to submit two
new command-lists for the above approach to work. One before
command-buffer *CB*, and return event *RE* our implementation has to submit
three new command-lists for the above approach to work. Two before
the command-list with extra commands associated with *CB*, and the other
after *CB*. These two new command-lists are retrieved from the UR queue, which
after *CB*. These new command-lists are retrieved from the UR queue, which
will likely reuse existing command-lists and only create a new one in the worst
case.

The L0 command-list created on `urCommandBufferEnqueueExp` to execute **before**
*CB* contains a single command. This command is a barrier on *EL* that signals
*CB*'s *WaitEvent* when completed.

The L0 command-list created on `urCommandBufferEnqueueExp` to execute **after**
*CB* also contains a single command. This command is a barrier on *CB*'s
*SignalEvent* that signals *RE* when completed.

#### Drawbacks

There are two drawbacks of this approach to implementing UR command-buffers for
There are three drawbacks of this approach to implementing UR command-buffers for
Level Zero:

1. 3x the command-list resources are used, if there are many UR command-buffers in
Expand Down
Binary file removed sycl/doc/design/images/L0_UR_command-buffer-v3.jpg
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
27 changes: 25 additions & 2 deletions sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -601,6 +601,28 @@ which is used in a graph will be kept alive on the host for the lifetime of the
graph. Destroying that buffer during the lifetime of a `command_graph`
constructed with this property results in undefined behavior.

===== Enable-Profiling Property [[enable-profiling]]

[source,c++]
----
namespace sycl::ext::oneapi::experimental::property::graph {
class enable_profiling {
public:
enable_profiling() = default;
};
}
----

The `property::graph::enable_profiling` property can be passed to the
`command_graph::finalize()` function and enables profiling support
for the returned `command_graph<graph_state::executable>`.
Passing this property to the finalize function implies disabling
certain optimizations to enable graph profiling.
As a result, the execution time of a graph finalized with profiling enabled
is longer than that of a graph without profiling capability.
An error will be thrown when attempting to profile an event
from a graph submission that was created without this property.

==== Graph Member Functions

Table {counter: tableNumber}. Constructor of the `command_graph` class.
Expand Down Expand Up @@ -804,8 +826,9 @@ Constraints:

Parameters:

* `propList` - Optional parameter for passing properties. No finalization
properties are defined by this extension.
* `propList` - Optional parameter for passing properties. The only defined
property is `enable_profiling`. See <<enable-profiling, Enable-Profiling>>
for more details.

Returns: A new executable graph object which can be submitted to a queue.

Expand Down
3 changes: 2 additions & 1 deletion sycl/include/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -2324,7 +2324,8 @@ typedef enum {
struct pi_ext_command_buffer_desc final {
pi_ext_structure_type stype;
const void *pNext;
pi_queue_properties *properties;
pi_bool is_in_order;
pi_bool enable_profiling;
};

/// API to create a command-buffer.
Expand Down
3 changes: 2 additions & 1 deletion sycl/include/sycl/detail/property_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,9 @@ enum DataLessPropKind {
GraphAssumeDataOutlivesBuffer = 22,
GraphAssumeBufferOutlivesGraph = 23,
GraphDependOnAllLeaves = 24,
GraphEnableProfiling = 25,
// Indicates the last known dataless property.
LastKnownDataLessPropKind = 24,
LastKnownDataLessPropKind = 25,
// Exceeding 32 may cause ABI breaking change on some of OSes.
DataLessPropKindSize = 32
};
Expand Down
14 changes: 13 additions & 1 deletion sycl/include/sycl/ext/oneapi/experimental/graph.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,15 @@ class assume_buffer_outlives_graph
public:
assume_buffer_outlives_graph() = default;
};

/// Property used to enable graph profiling.
/// Passing this property to the `command_graph::finalize()` function
/// ensures that profiling can be used on the generated graph.
class enable_profiling : public ::sycl::detail::DataLessProperty<
::sycl::detail::GraphEnableProfiling> {
public:
enable_profiling() = default;
};
} // namespace graph

namespace node {
Expand Down Expand Up @@ -340,8 +349,11 @@ class __SYCL_EXPORT executable_command_graph {
/// Constructor used by internal runtime.
/// @param Graph Detail implementation class to construct with.
/// @param Ctx Context to use for graph.
/// @param PropList Optional list of properties to pass.
executable_command_graph(const std::shared_ptr<detail::graph_impl> &Graph,
const sycl::context &Ctx);
const sycl::context &Ctx,
const property_list &PropList = {});


template <class Obj>
friend decltype(Obj::impl)
Expand Down
12 changes: 3 additions & 9 deletions sycl/plugins/unified_runtime/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,15 +56,9 @@ endif()
if(SYCL_PI_UR_USE_FETCH_CONTENT)
include(FetchContent)

set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
# commit 93902794c09f1c8cb5163089f0eacfe55c1f1b86
# Merge: cc268e58 5518b489
# Author: Kenneth Benzie (Benie) <[email protected]>
# Date: Fri Mar 8 09:48:44 2024 +0000
# Merge pull request #1419 from nrspruit/main_l0_adapter_release_lib
# [L0] Create/Destroy Adapter Handle during lib init
set(UNIFIED_RUNTIME_TAG 93902794c09f1c8cb5163089f0eacfe55c1f1b86)

set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git")
set(UNIFIED_RUNTIME_TAG maxime/L0-optimizations)

if(SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO)
set(UNIFIED_RUNTIME_REPO "${SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO}")
endif()
Expand Down
8 changes: 5 additions & 3 deletions sycl/plugins/unified_runtime/pi2ur.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -4462,13 +4462,15 @@ piextCommandBufferCreate(pi_context Context, pi_device Device,
ur_context_handle_t UrContext =
reinterpret_cast<ur_context_handle_t>(Context);
ur_device_handle_t UrDevice = reinterpret_cast<ur_device_handle_t>(Device);
const ur_exp_command_buffer_desc_t *UrDesc =
reinterpret_cast<const ur_exp_command_buffer_desc_t *>(Desc);
ur_exp_command_buffer_desc_t UrDesc;
UrDesc.stype = UR_STRUCTURE_TYPE_EXP_COMMAND_BUFFER_DESC;
UrDesc.isInOrder = Desc->is_in_order;
UrDesc.enableProfiling = Desc->enable_profiling;
ur_exp_command_buffer_handle_t *UrCommandBuffer =
reinterpret_cast<ur_exp_command_buffer_handle_t *>(RetCommandBuffer);

HANDLE_ERRORS(
urCommandBufferCreateExp(UrContext, UrDevice, UrDesc, UrCommandBuffer));
urCommandBufferCreateExp(UrContext, UrDevice, &UrDesc, UrCommandBuffer));

return PI_SUCCESS;
}
Expand Down
4 changes: 3 additions & 1 deletion sycl/source/detail/event_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,8 @@ class event_impl {
return MPostCompleteEvents;
}

void setProfilingEnabled(bool Value) { MIsProfilingEnabled = Value; }

protected:
// When instrumentation is enabled emits trace event for event wait begin and
// returns the telemetry event generated for the wait
Expand All @@ -316,7 +318,7 @@ class event_impl {
std::unique_ptr<HostProfilingInfo> MHostProfilingInfo;
void *MCommand = nullptr;
std::weak_ptr<queue_impl> MQueue;
const bool MIsProfilingEnabled = false;
bool MIsProfilingEnabled = false;
const bool MFallbackProfiling = false;

std::weak_ptr<queue_impl> MWorkerQueue;
Expand Down
25 changes: 19 additions & 6 deletions sycl/source/detail/graph_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
//
//===----------------------------------------------------------------------===//

#define FORCE_IMM_CMD_LIST 0

#include <detail/graph_impl.hpp>
#include <detail/handler_impl.hpp>
#include <detail/kernel_arg_mask.hpp>
Expand Down Expand Up @@ -265,6 +267,7 @@ void exec_graph_impl::makePartitions() {
}
if (Partition->MRoots.size() > 0) {
Partition->schedule();
Partition->checkIfGraphIsSinglePath();
MPartitions.push_back(Partition);
PartitionFinalNum++;
}
Expand Down Expand Up @@ -680,7 +683,13 @@ sycl::detail::pi::PiExtSyncPoint exec_graph_impl::enqueueNode(
void exec_graph_impl::createCommandBuffers(
sycl::device Device, std::shared_ptr<partition> &Partition) {
sycl::detail::pi::PiExtCommandBuffer OutCommandBuffer;
sycl::detail::pi::PiExtCommandBufferDesc Desc{};
auto imm_cmd_list = (Partition->MIsInOrderGraph && !MEnableProfiling);
#if FORCE_IMM_CMD_LIST == 1
imm_cmd_list = true;
#endif
sycl::detail::pi::PiExtCommandBufferDesc Desc{
pi_ext_structure_type::PI_EXT_STRUCTURE_TYPE_COMMAND_BUFFER_DESC, nullptr,
pi_bool(imm_cmd_list), pi_bool(MEnableProfiling)};
auto ContextImpl = sycl::detail::getSyclObjImpl(MContext);
const sycl::detail::PluginPtr &Plugin = ContextImpl->getPlugin();
auto DeviceImpl = sycl::detail::getSyclObjImpl(Device);
Expand Down Expand Up @@ -950,6 +959,9 @@ exec_graph_impl::enqueue(const std::shared_ptr<sycl::detail::queue_impl> &Queue,
NewEvent->attachEventToComplete(Elem.second);
}
}
if (!MEnableProfiling) {
NewEvent->setProfilingEnabled(false);
}
sycl::event QueueEvent =
sycl::detail::createSyclObjFromImpl<sycl::event>(NewEvent);
return QueueEvent;
Expand Down Expand Up @@ -1160,12 +1172,12 @@ void modifiable_command_graph::make_edge(node &Src, node &Dest) {
}

command_graph<graph_state::executable>
modifiable_command_graph::finalize(const sycl::property_list &) const {
modifiable_command_graph::finalize(const sycl::property_list &PropList) const {
// Graph is read and written in this scope so we lock
// this graph with full priviledges.
graph_impl::WriteLock Lock(impl->MMutex);
return command_graph<graph_state::executable>{this->impl,
this->impl->getContext()};
return command_graph<graph_state::executable>{
this->impl, this->impl->getContext(), PropList};
}

bool modifiable_command_graph::begin_recording(queue &RecordingQueue) {
Expand Down Expand Up @@ -1279,8 +1291,9 @@ std::vector<node> modifiable_command_graph::get_root_nodes() const {
}

executable_command_graph::executable_command_graph(
const std::shared_ptr<detail::graph_impl> &Graph, const sycl::context &Ctx)
: impl(std::make_shared<detail::exec_graph_impl>(Ctx, Graph)) {
const std::shared_ptr<detail::graph_impl> &Graph, const sycl::context &Ctx,
const property_list &PropList)
: impl(std::make_shared<detail::exec_graph_impl>(Ctx, Graph, PropList)) {
finalizeImpl(); // Create backend representation for executable graph
}

Expand Down
Loading
Loading