Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Graph] Enable L0 optimizations (no profiling mode) #358

Closed
wants to merge 12 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 42 additions & 4 deletions sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,7 @@ class assume_buffer_outlives_graph {
public:
assume_buffer_outlives_graph() = default;
};

EwanC marked this conversation as resolved.
Show resolved Hide resolved
} // namespace graph

namespace node {
Expand All @@ -356,6 +357,11 @@ class depends_on_all_leaves {
depends_on_all_leaves() = default;
};

class enable_profiling {
public:
enable_profiling() = default;
};

} // namespace node
} // namespace property

Expand Down Expand Up @@ -580,6 +586,31 @@ class depends_on_all_leaves {
}
----

===== Enable-Profiling Property [[enable-profiling]]

The `property::graph::enable_profiling` property can be passed to a
`command_graph::add()` function and enables profiling support
for the node in the `command_graph<graph_state::executable>`.
Passing this property implies disabling certain optimizations.
This is why profiling is by default disabled on graphs, unless users
explicitly require it using either the `property::graph::enable_profiling`
property in building mode or the `property::queue::enable_profiling` on
the recorded queue (Record&Replay API).
As a result, the execution time of a graph finalized with profiling enabled
is longer than that of a graph without profiling capability.
An error will be thrown when attempting to profile an event
from a graph submission that was created without this property.

[source,c++]
----
namespace sycl::ext::oneapi::experimental::property::node {
class enable_profiling {
public:
enable_profiling() = default;
};
}
----

=== Graph

This extension adds a new `command_graph` object which follows the
Expand Down Expand Up @@ -744,6 +775,8 @@ Parameters:
* `propList` - Zero or more properties can be provided to the constructed node
via an instance of `property_list`. The `property::node::depends_on` property
can be passed here with a list of nodes to create dependency edges on.
The `enable_profiling` property enables the profiling of this node.
See <<enable-profiling, Enable-Profiling>> for more details.


Returns: The empty node which has been added to the graph.
Expand Down Expand Up @@ -781,6 +814,9 @@ Parameters:
* `propList` - Zero or more properties can be provided to the constructed node
via an instance of `property_list`. The `property::node::depends_on` property
can be passed here with a list of nodes to create dependency edges on.
The `enable_profiling` property enables the profiling of this node.
See <<enable-profiling, Enable-Profiling>> for more details.


Returns: The command-group function object node which has been added to the graph.

Expand Down Expand Up @@ -1073,10 +1109,12 @@ ways:
an implicit dependency before and after the graph execution, as if the graph
execution is one command-group submitted to the in-order queue.

2. `property::queue::enable_profiling` - This property has no effect on graph
recording. When set on the queue a graph is submitted to however, it allows
profiling information to be obtained from the event returned by a graph
submission. As it is not defined how a submitted graph will be split up for
2. `property::queue::enable_profiling` - This property must be set on the queue
in recording mode if users want to profile the commands recorded to
the graph.This property must also be set on the queue the queue a graph is
submitted to. It allows profiling information to be obtained from the event
returned by a graph submission.
As it is not defined how a submitted graph will be split up for
scheduling at runtime, the `uint64_t` timestamp reported from a profiling
query on a graph execution event has the following semantics, which may be
pessimistic about execution time on device.
Expand Down
3 changes: 2 additions & 1 deletion sycl/include/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -2304,7 +2304,8 @@ typedef enum {
struct pi_ext_command_buffer_desc final {
pi_ext_structure_type stype;
const void *pNext;
pi_queue_properties *properties;
pi_bool is_in_order;
pi_bool enable_profiling;
};

/// API to create a command-buffer.
Expand Down
3 changes: 2 additions & 1 deletion sycl/include/sycl/detail/property_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,9 @@ enum DataLessPropKind {
GraphAssumeDataOutlivesBuffer = 22,
GraphAssumeBufferOutlivesGraph = 23,
GraphDependOnAllLeaves = 24,
GraphEnableProfiling = 25,
// Indicates the last known dataless property.
LastKnownDataLessPropKind = 24,
LastKnownDataLessPropKind = 25,
// Exceeding 32 may cause ABI breaking change on some of OSes.
DataLessPropKindSize = 32
};
Expand Down
33 changes: 25 additions & 8 deletions sycl/include/sycl/ext/oneapi/experimental/graph.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,15 @@ class depends_on_all_leaves : public ::sycl::detail::DataLessProperty<
depends_on_all_leaves() = default;
};

/// Property used to enable node profiling.
/// Passing this property to the `command_graph::add()` function
/// ensures that profiling can be queried on this node.
class enable_profiling : public ::sycl::detail::DataLessProperty<
::sycl::detail::GraphEnableProfiling> {
public:
enable_profiling() = default;
};

} // namespace node
} // namespace property

Expand All @@ -200,15 +209,17 @@ class __SYCL_EXPORT modifiable_command_graph {
/// @param PropList Property list used to pass [0..n] predecessor nodes.
/// @return Constructed empty node which has been added to the graph.
node add(const property_list &PropList = {}) {
bool EnableProfiling =
PropList.has_property<property::node::enable_profiling>();
if (PropList.has_property<property::node::depends_on>()) {
auto Deps = PropList.get_property<property::node::depends_on>();
node Node = addImpl(Deps.get_dependencies());
node Node = addImpl(Deps.get_dependencies(), EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
return Node;
}
node Node = addImpl({});
node Node = addImpl({}, EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
Expand All @@ -220,15 +231,17 @@ class __SYCL_EXPORT modifiable_command_graph {
/// @param PropList Property list used to pass [0..n] predecessor nodes.
/// @return Constructed node which has been added to the graph.
template <typename T> node add(T CGF, const property_list &PropList = {}) {
bool EnableProfiling =
PropList.has_property<property::node::enable_profiling>();
if (PropList.has_property<property::node::depends_on>()) {
auto Deps = PropList.get_property<property::node::depends_on>();
node Node = addImpl(CGF, Deps.get_dependencies());
node Node = addImpl(CGF, Deps.get_dependencies(), EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
return Node;
}
node Node = addImpl(CGF, {});
node Node = addImpl(CGF, {}, EnableProfiling);
if (PropList.has_property<property::node::depends_on_all_leaves>()) {
addGraphLeafDependencies(Node);
}
Expand Down Expand Up @@ -303,14 +316,16 @@ class __SYCL_EXPORT modifiable_command_graph {
/// Template-less implementation of add() for CGF nodes.
/// @param CGF Command-group function to add.
/// @param Dep List of predecessor nodes.
/// @param EnableProfiling Enable node profiling.
/// @return Node added to the graph.
node addImpl(std::function<void(handler &)> CGF,
const std::vector<node> &Dep);
node addImpl(std::function<void(handler &)> CGF, const std::vector<node> &Dep,
const bool EnableProfiling);

/// Template-less implementation of add() for empty nodes.
/// @param Dep List of predecessor nodes.
/// @param EnableProfiling Enable node profiling.
/// @return Node added to the graph.
node addImpl(const std::vector<node> &Dep);
node addImpl(const std::vector<node> &Dep, const bool EnableProfiling);

/// Adds all graph leaves as dependencies
/// @param Node Destination node to which the leaves of the graph will be
Expand Down Expand Up @@ -340,8 +355,10 @@ class __SYCL_EXPORT executable_command_graph {
/// Constructor used by internal runtime.
/// @param Graph Detail implementation class to construct with.
/// @param Ctx Context to use for graph.
/// @param PropList Optional list of properties to pass.
executable_command_graph(const std::shared_ptr<detail::graph_impl> &Graph,
const sycl::context &Ctx);
const sycl::context &Ctx,
const property_list &PropList = {});

template <class Obj>
friend decltype(Obj::impl)
Expand Down
11 changes: 3 additions & 8 deletions sycl/plugins/unified_runtime/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,9 @@ endif()
if(SYCL_PI_UR_USE_FETCH_CONTENT)
include(FetchContent)

set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git")
# commit 3487672ceba0fd3575b5f3f15a832b100dc5fbad
# Author: Artur Gainullin <[email protected]>
# Date: Fri Feb 16 09:59:50 2024 -0800
#
# [UR] Provide flexibility to replace unified-memory-framework repo and tag
set(UNIFIED_RUNTIME_TAG 3487672ceba0fd3575b5f3f15a832b100dc5fbad)

set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git")
set(UNIFIED_RUNTIME_TAG maxime/in-order-cmd-list)

if(SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO)
set(UNIFIED_RUNTIME_REPO "${SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO}")
endif()
Expand Down
8 changes: 5 additions & 3 deletions sycl/plugins/unified_runtime/pi2ur.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -4425,13 +4425,15 @@ piextCommandBufferCreate(pi_context Context, pi_device Device,
ur_context_handle_t UrContext =
reinterpret_cast<ur_context_handle_t>(Context);
ur_device_handle_t UrDevice = reinterpret_cast<ur_device_handle_t>(Device);
const ur_exp_command_buffer_desc_t *UrDesc =
reinterpret_cast<const ur_exp_command_buffer_desc_t *>(Desc);
ur_exp_command_buffer_desc_t UrDesc;
UrDesc.stype = UR_STRUCTURE_TYPE_EXP_COMMAND_BUFFER_DESC;
UrDesc.isInOrder = Desc->is_in_order;
UrDesc.enableProfiling = Desc->enable_profiling;
ur_exp_command_buffer_handle_t *UrCommandBuffer =
reinterpret_cast<ur_exp_command_buffer_handle_t *>(RetCommandBuffer);

HANDLE_ERRORS(
urCommandBufferCreateExp(UrContext, UrDevice, UrDesc, UrCommandBuffer));
urCommandBufferCreateExp(UrContext, UrDevice, &UrDesc, UrCommandBuffer));

return PI_SUCCESS;
}
Expand Down
4 changes: 3 additions & 1 deletion sycl/source/detail/event_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,8 @@ class event_impl {
return MEventFromSubmittedExecCommandBuffer;
}

void setProfilingEnabled(bool Value) { MIsProfilingEnabled = Value; }

protected:
// When instrumentation is enabled emits trace event for event wait begin and
// returns the telemetry event generated for the wait
Expand All @@ -313,7 +315,7 @@ class event_impl {
std::unique_ptr<HostProfilingInfo> MHostProfilingInfo;
void *MCommand = nullptr;
std::weak_ptr<queue_impl> MQueue;
const bool MIsProfilingEnabled = false;
bool MIsProfilingEnabled = false;
EwanC marked this conversation as resolved.
Show resolved Hide resolved
const bool MFallbackProfiling = false;

std::weak_ptr<queue_impl> MWorkerQueue;
Expand Down
33 changes: 25 additions & 8 deletions sycl/source/detail/graph_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,13 @@ void exec_graph_impl::makePartitions() {
if (Node->MCGType == sycl::detail::CG::CodeplayHostTask) {
HostTaskList.push_back(Node);
}
// Next line is supposed to be temporary.
// Nodes are not profiled individually, but the profiling of the whole graph
// is enabled if at least one node has profiling enabled. This should be
// changed once the PR https://github.com/intel/llvm/pull/12592 on node
// profiling is merged. This also will involve updating all the UR enqueue
// cmd functions to add a new parameter containing the profiling status.
MEnableProfiling |= Node->MProfilingEnabled;
}

// Annotate nodes
Expand Down Expand Up @@ -265,6 +272,7 @@ void exec_graph_impl::makePartitions() {
}
if (Partition->MRoots.size() > 0) {
Partition->schedule();
Partition->checkIfGraphIsSinglePath();
MPartitions.push_back(Partition);
PartitionFinalNum++;
}
Expand Down Expand Up @@ -680,7 +688,10 @@ sycl::detail::pi::PiExtSyncPoint exec_graph_impl::enqueueNode(
void exec_graph_impl::createCommandBuffers(
sycl::device Device, std::shared_ptr<partition> &Partition) {
sycl::detail::pi::PiExtCommandBuffer OutCommandBuffer;
sycl::detail::pi::PiExtCommandBufferDesc Desc{};
sycl::detail::pi::PiExtCommandBufferDesc Desc{
pi_ext_structure_type::PI_EXT_STRUCTURE_TYPE_COMMAND_BUFFER_DESC, nullptr,
pi_bool(Partition->MIsInOrderGraph && !MEnableProfiling),
pi_bool(MEnableProfiling)};
auto ContextImpl = sycl::detail::getSyclObjImpl(MContext);
const sycl::detail::PluginPtr &Plugin = ContextImpl->getPlugin();
auto DeviceImpl = sycl::detail::getSyclObjImpl(Device);
Expand Down Expand Up @@ -942,6 +953,7 @@ exec_graph_impl::enqueue(const std::shared_ptr<sycl::detail::queue_impl> &Queue,
NewEvent->attachEventToComplete(Elem.second);
}
}
NewEvent->setProfilingEnabled(MEnableProfiling);
sycl::event QueueEvent =
sycl::detail::createSyclObjFromImpl<sycl::event>(NewEvent);
return QueueEvent;
Expand Down Expand Up @@ -1102,7 +1114,8 @@ modifiable_command_graph::modifiable_command_graph(
: impl(std::make_shared<detail::graph_impl>(
SyclQueue.get_context(), SyclQueue.get_device(), PropList)) {}

node modifiable_command_graph::addImpl(const std::vector<node> &Deps) {
node modifiable_command_graph::addImpl(const std::vector<node> &Deps,
const bool EnableProfiling) {
impl->throwIfGraphRecordingQueue("Explicit API \"Add()\" function");
std::vector<std::shared_ptr<detail::node_impl>> DepImpls;
for (auto &D : Deps) {
Expand All @@ -1111,11 +1124,13 @@ node modifiable_command_graph::addImpl(const std::vector<node> &Deps) {

graph_impl::WriteLock Lock(impl->MMutex);
std::shared_ptr<detail::node_impl> NodeImpl = impl->add(impl, DepImpls);
NodeImpl->MProfilingEnabled = EnableProfiling;
return sycl::detail::createSyclObjFromImpl<node>(NodeImpl);
}

node modifiable_command_graph::addImpl(std::function<void(handler &)> CGF,
const std::vector<node> &Deps) {
const std::vector<node> &Deps,
const bool EnableProfiling) {
impl->throwIfGraphRecordingQueue("Explicit API \"Add()\" function");
std::vector<std::shared_ptr<detail::node_impl>> DepImpls;
for (auto &D : Deps) {
Expand All @@ -1125,6 +1140,7 @@ node modifiable_command_graph::addImpl(std::function<void(handler &)> CGF,
graph_impl::WriteLock Lock(impl->MMutex);
std::shared_ptr<detail::node_impl> NodeImpl =
impl->add(impl, CGF, {}, DepImpls);
NodeImpl->MProfilingEnabled = EnableProfiling;
return sycl::detail::createSyclObjFromImpl<node>(NodeImpl);
}

Expand Down Expand Up @@ -1152,12 +1168,12 @@ void modifiable_command_graph::make_edge(node &Src, node &Dest) {
}

command_graph<graph_state::executable>
modifiable_command_graph::finalize(const sycl::property_list &) const {
modifiable_command_graph::finalize(const sycl::property_list &PropList) const {
// Graph is read and written in this scope so we lock
// this graph with full priviledges.
graph_impl::WriteLock Lock(impl->MMutex);
return command_graph<graph_state::executable>{this->impl,
this->impl->getContext()};
return command_graph<graph_state::executable>{
this->impl, this->impl->getContext(), PropList};
}

bool modifiable_command_graph::begin_recording(queue &RecordingQueue) {
Expand Down Expand Up @@ -1271,8 +1287,9 @@ std::vector<node> modifiable_command_graph::get_root_nodes() const {
}

executable_command_graph::executable_command_graph(
const std::shared_ptr<detail::graph_impl> &Graph, const sycl::context &Ctx)
: impl(std::make_shared<detail::exec_graph_impl>(Ctx, Graph)) {
const std::shared_ptr<detail::graph_impl> &Graph, const sycl::context &Ctx,
const property_list &PropList)
: impl(std::make_shared<detail::exec_graph_impl>(Ctx, Graph, PropList)) {
finalizeImpl(); // Create backend representation for executable graph
}

Expand Down
Loading