diff --git a/sycl/doc/design/CommandGraph.md b/sycl/doc/design/CommandGraph.md index 56bee306aa9cc..8f5d715a32925 100644 --- a/sycl/doc/design/CommandGraph.md +++ b/sycl/doc/design/CommandGraph.md @@ -250,59 +250,107 @@ there are no parameters to take a wait-list, and the only sync primitive returned is blocking on host. In order to achieve the expected UR command-buffer enqueue semantics with Level -Zero, the adapter implementation adds extra commands to the Level Zero -command-list representing a UR command-buffer. - -* Prefix - Commands added to the start of the L0 command-list by L0 adapter. -* Suffix - Commands added to the end of the L0 command-list by L0 adapter. - -These extra commands operate on L0 event synchronisation primitives, used by the -command-list to interact with the external UR wait-list and UR return event -required for the enqueue interface. - -The `ur_exp_command_buffer_handle_t` class for this adapter contains a -*SignalEvent* which signals the completion of the command-list in the suffix, -and is reset in the prefix. This signal is detected by a new UR return event -created on UR command-buffer enqueue. - -There is also a *WaitEvent* used by the `ur_exp_command_buffer_handle_t` class -in the prefix to wait on any dependencies passed in the enqueue wait-list. -This WaitEvent is reset in the suffix. - -A command-buffer is expected to be submitted multiple times. Consequently, +Zero, the adapter implementation needs extra commands. + +* Prefix - Commands added **before** the graph workload. +* Suffix - Commands added **after** the graph workload. + +These extra commands operate on L0 event synchronisation primitives, +used by the command-list to interact with the external UR wait-list +and UR return event required for the enqueue interface. +Unlike the graph workload (i.e. commands needed to perform the graph workload) +the external UR wait-list and UR return event are submission dependent, +which mean they can change from one submission to the next. + +For performance concerns, the command-list that will execute the graph +workload is made only once (during the command-buffer finalization stage). +This allows the adapter to save time when submitting the command-buffer, +by executing only this command-list (i.e. without enqueuing any commands +of the graph workload). + +#### Prefix + +The prefix's commands aim to: +1. Handle the the list on events to wait on, which is passed by the runtime +when the UR command-buffer enqueue function is called. +As mentioned above, this list of events changes from one submission +to the next. +Consequently, managing this mutable dependency in the graph-workload +command-list implies rebuilding the command-list for each submission +(note that this can change with mutable command-list). +To avoid the signifiant time penalty of rebuilding this potentially large +command-list each time, we prefer to add an extra command handling the +wait list into another command-list (*wait command-list*). +This command-list consists of a single L0 command: a barrier that waits for +dependencies passed by the wait-list and signals a signal +called *WaitEvent* when the barrier is complete. +This *WaitEvent* is defined in the `ur_exp_command_buffer_handle_t` class. +In the front of the graph workload command list, an extra barrier command +waiting for this event is added (when the command-buffer is created). +This ensures that the graph workload does not start running before +the dependencies to be completed. +The *WaitEvent* event is reset in the suffix. + + +2. Reset events associated with the command-buffer except the +*WaitEvent* event. +Indeed, L0 events needs to be explicitly reset by an API call +(L0 command in our case). +Since a command-buffer is expected to be submitted multiple times, we need to ensure that L0 events associated with graph commands have not been signaled by a previous execution. These events are therefore reset to the -non-signaled state before running the actual graph associated commands. Note +non-signaled state before running the graph-workload command-list. Note that this reset is performed in the prefix and not in the suffix to avoid additional synchronization w.r.t profiling data extraction. - -If a command-buffer is about to be submitted to a queue with the profiling -property enabled, an extra command that copies timestamps of L0 events -associated with graph commands into a dedicated memory which is attached to the -returned UR event. This memory stores the profiling information that -corresponds to the current submission of the command-buffer. - -![L0 command-buffer diagram](images/L0_UR_command-buffer-v3.jpg) +We use a new command list (*reset command-list*) for performance concerns. +Indeed: + * This allows the *WaitEvent* to be signaled directly on the host if + the waiting list is empty, thus avoiding the need to submit a command list. + * Enqueuing a reset L0 command for all events in the command-buffer is time + consumming, especially for large graphs. + However, this task is not needed for every submission, but only once, when the + command-buffer is fixed, i.e. when the command-buffer is finalized. The + decorellation between the reset command-list and the wait command-list allow us to + create and enqueue the reset commands when finalizing the command-buffer, + and only create the wait command-list at submission. + +This command list is consist of a reset command for each of the graph commands +and another reset command for resetting the signal we use to signal the completion +of the graph workload. This signal is called *SignalEvent* and is defined in +in the `ur_exp_command_buffer_handle_t` class. + +#### Suffix + +The suffix's commands aim to: +1) Handle the completion of the graph workload and signal +an UR return event. +Thus, at the end of the graph workload command-list a command, which +signals the *SignalEvent*, is added (when the command-buffer is finalized). +In an additional command-list (*signal command-list*), a barrier waiting for +this event is also added. +This barrier signals, in turn, the UR return event that has be defined by +the runtime layer when calling the `urCommandBufferEnqueueExp` function. + +2) Manage the profiling. If a command-buffer is about to be submitted to +a queue with the profiling property enabled, an extra command that copies +timestamps of L0 events associated with graph commands into a dedicated +memory which is attached to the returned UR event. +This memory stores the profiling information that corresponds to +the current submission of the command-buffer. + +![L0 command-buffer diagram](images/L0_UR_command-buffer-v5.jpg) For a call to `urCommandBufferEnqueueExp` with an `event_list` *EL*, -command-buffer *CB*, and return event *RE* our implementation has to submit two -new command-lists for the above approach to work. One before +command-buffer *CB*, and return event *RE* our implementation has to submit +three new command-lists for the above approach to work. Two before the command-list with extra commands associated with *CB*, and the other -after *CB*. These two new command-lists are retrieved from the UR queue, which +after *CB*. These new command-lists are retrieved from the UR queue, which will likely reuse existing command-lists and only create a new one in the worst case. -The L0 command-list created on `urCommandBufferEnqueueExp` to execute **before** -*CB* contains a single command. This command is a barrier on *EL* that signals -*CB*'s *WaitEvent* when completed. - -The L0 command-list created on `urCommandBufferEnqueueExp` to execute **after** -*CB* also contains a single command. This command is a barrier on *CB*'s -*SignalEvent* that signals *RE* when completed. - #### Drawbacks -There are two drawbacks of this approach to implementing UR command-buffers for +There are three drawbacks of this approach to implementing UR command-buffers for Level Zero: 1. 3x the command-list resources are used, if there are many UR command-buffers in diff --git a/sycl/doc/design/images/L0_UR_command-buffer-v3.jpg b/sycl/doc/design/images/L0_UR_command-buffer-v3.jpg deleted file mode 100644 index 5b4ff1c3e9aab..0000000000000 Binary files a/sycl/doc/design/images/L0_UR_command-buffer-v3.jpg and /dev/null differ diff --git a/sycl/doc/design/images/L0_UR_command-buffer-v5.jpg b/sycl/doc/design/images/L0_UR_command-buffer-v5.jpg new file mode 100644 index 0000000000000..81319b2bead42 Binary files /dev/null and b/sycl/doc/design/images/L0_UR_command-buffer-v5.jpg differ diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc index f2a4532b2c81c..d2b4e0e112681 100644 --- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc @@ -601,6 +601,28 @@ which is used in a graph will be kept alive on the host for the lifetime of the graph. Destroying that buffer during the lifetime of a `command_graph` constructed with this property results in undefined behavior. +===== Enable-Profiling Property [[enable-profiling]] + +[source,c++] +---- +namespace sycl::ext::oneapi::experimental::property::graph { +class enable_profiling { + public: + enable_profiling() = default; +}; +} +---- + +The `property::graph::enable_profiling` property can be passed to the +`command_graph::finalize()` function and enables profiling support +for the returned `command_graph`. +Passing this property to the finalize function implies disabling +certain optimizations to enable graph profiling. +As a result, the execution time of a graph finalized with profiling enabled +is longer than that of a graph without profiling capability. +An error will be thrown when attempting to profile an event +from a graph submission that was created without this property. + ==== Graph Member Functions Table {counter: tableNumber}. Constructor of the `command_graph` class. @@ -804,8 +826,9 @@ Constraints: Parameters: -* `propList` - Optional parameter for passing properties. No finalization - properties are defined by this extension. +* `propList` - Optional parameter for passing properties. The only defined +property is `enable_profiling`. See <> +for more details. Returns: A new executable graph object which can be submitted to a queue. diff --git a/sycl/include/sycl/detail/pi.h b/sycl/include/sycl/detail/pi.h index 56fdeb7a1051b..039cf1ccd9973 100644 --- a/sycl/include/sycl/detail/pi.h +++ b/sycl/include/sycl/detail/pi.h @@ -2324,7 +2324,8 @@ typedef enum { struct pi_ext_command_buffer_desc final { pi_ext_structure_type stype; const void *pNext; - pi_queue_properties *properties; + pi_bool is_in_order; + pi_bool enable_profiling; }; /// API to create a command-buffer. diff --git a/sycl/include/sycl/detail/property_helper.hpp b/sycl/include/sycl/detail/property_helper.hpp index 3009af8ee2890..7743f1e5a370a 100644 --- a/sycl/include/sycl/detail/property_helper.hpp +++ b/sycl/include/sycl/detail/property_helper.hpp @@ -47,8 +47,9 @@ enum DataLessPropKind { GraphAssumeDataOutlivesBuffer = 22, GraphAssumeBufferOutlivesGraph = 23, GraphDependOnAllLeaves = 24, + GraphEnableProfiling = 25, // Indicates the last known dataless property. - LastKnownDataLessPropKind = 24, + LastKnownDataLessPropKind = 25, // Exceeding 32 may cause ABI breaking change on some of OSes. DataLessPropKindSize = 32 }; diff --git a/sycl/include/sycl/ext/oneapi/experimental/graph.hpp b/sycl/include/sycl/ext/oneapi/experimental/graph.hpp index 209a0ed25f72f..1ffbf3c6d83c3 100644 --- a/sycl/include/sycl/ext/oneapi/experimental/graph.hpp +++ b/sycl/include/sycl/ext/oneapi/experimental/graph.hpp @@ -146,6 +146,15 @@ class assume_buffer_outlives_graph public: assume_buffer_outlives_graph() = default; }; + +/// Property used to enable graph profiling. +/// Passing this property to the `command_graph::finalize()` function +/// ensures that profiling can be used on the generated graph. +class enable_profiling : public ::sycl::detail::DataLessProperty< + ::sycl::detail::GraphEnableProfiling> { +public: + enable_profiling() = default; +}; } // namespace graph namespace node { @@ -340,8 +349,11 @@ class __SYCL_EXPORT executable_command_graph { /// Constructor used by internal runtime. /// @param Graph Detail implementation class to construct with. /// @param Ctx Context to use for graph. + /// @param PropList Optional list of properties to pass. executable_command_graph(const std::shared_ptr &Graph, - const sycl::context &Ctx); + const sycl::context &Ctx, + const property_list &PropList = {}); + template friend decltype(Obj::impl) diff --git a/sycl/plugins/unified_runtime/CMakeLists.txt b/sycl/plugins/unified_runtime/CMakeLists.txt index de561b5304014..29e4dfb13127f 100644 --- a/sycl/plugins/unified_runtime/CMakeLists.txt +++ b/sycl/plugins/unified_runtime/CMakeLists.txt @@ -56,15 +56,9 @@ endif() if(SYCL_PI_UR_USE_FETCH_CONTENT) include(FetchContent) - set(UNIFIED_RUNTIME_REPO "https://github.com/oneapi-src/unified-runtime.git") - # commit 93902794c09f1c8cb5163089f0eacfe55c1f1b86 - # Merge: cc268e58 5518b489 - # Author: Kenneth Benzie (Benie) - # Date: Fri Mar 8 09:48:44 2024 +0000 - # Merge pull request #1419 from nrspruit/main_l0_adapter_release_lib - # [L0] Create/Destroy Adapter Handle during lib init - set(UNIFIED_RUNTIME_TAG 93902794c09f1c8cb5163089f0eacfe55c1f1b86) - + set(UNIFIED_RUNTIME_REPO "https://github.com/bensuo/unified-runtime.git") + set(UNIFIED_RUNTIME_TAG maxime/L0-optimizations) + if(SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO) set(UNIFIED_RUNTIME_REPO "${SYCL_PI_UR_OVERRIDE_FETCH_CONTENT_REPO}") endif() diff --git a/sycl/plugins/unified_runtime/pi2ur.hpp b/sycl/plugins/unified_runtime/pi2ur.hpp index 87ee60f41e2da..0fc65f68813cd 100644 --- a/sycl/plugins/unified_runtime/pi2ur.hpp +++ b/sycl/plugins/unified_runtime/pi2ur.hpp @@ -4462,13 +4462,15 @@ piextCommandBufferCreate(pi_context Context, pi_device Device, ur_context_handle_t UrContext = reinterpret_cast(Context); ur_device_handle_t UrDevice = reinterpret_cast(Device); - const ur_exp_command_buffer_desc_t *UrDesc = - reinterpret_cast(Desc); + ur_exp_command_buffer_desc_t UrDesc; + UrDesc.stype = UR_STRUCTURE_TYPE_EXP_COMMAND_BUFFER_DESC; + UrDesc.isInOrder = Desc->is_in_order; + UrDesc.enableProfiling = Desc->enable_profiling; ur_exp_command_buffer_handle_t *UrCommandBuffer = reinterpret_cast(RetCommandBuffer); HANDLE_ERRORS( - urCommandBufferCreateExp(UrContext, UrDevice, UrDesc, UrCommandBuffer)); + urCommandBufferCreateExp(UrContext, UrDevice, &UrDesc, UrCommandBuffer)); return PI_SUCCESS; } diff --git a/sycl/source/detail/event_impl.hpp b/sycl/source/detail/event_impl.hpp index ed5b9ef07ef91..5178eee29f7f6 100644 --- a/sycl/source/detail/event_impl.hpp +++ b/sycl/source/detail/event_impl.hpp @@ -293,6 +293,8 @@ class event_impl { return MPostCompleteEvents; } + void setProfilingEnabled(bool Value) { MIsProfilingEnabled = Value; } + protected: // When instrumentation is enabled emits trace event for event wait begin and // returns the telemetry event generated for the wait @@ -316,7 +318,7 @@ class event_impl { std::unique_ptr MHostProfilingInfo; void *MCommand = nullptr; std::weak_ptr MQueue; - const bool MIsProfilingEnabled = false; + bool MIsProfilingEnabled = false; const bool MFallbackProfiling = false; std::weak_ptr MWorkerQueue; diff --git a/sycl/source/detail/graph_impl.cpp b/sycl/source/detail/graph_impl.cpp index 5b0bc3b8324e9..efdb208421d60 100644 --- a/sycl/source/detail/graph_impl.cpp +++ b/sycl/source/detail/graph_impl.cpp @@ -6,6 +6,8 @@ // //===----------------------------------------------------------------------===// +#define FORCE_IMM_CMD_LIST 0 + #include #include #include @@ -265,6 +267,7 @@ void exec_graph_impl::makePartitions() { } if (Partition->MRoots.size() > 0) { Partition->schedule(); + Partition->checkIfGraphIsSinglePath(); MPartitions.push_back(Partition); PartitionFinalNum++; } @@ -680,7 +683,13 @@ sycl::detail::pi::PiExtSyncPoint exec_graph_impl::enqueueNode( void exec_graph_impl::createCommandBuffers( sycl::device Device, std::shared_ptr &Partition) { sycl::detail::pi::PiExtCommandBuffer OutCommandBuffer; - sycl::detail::pi::PiExtCommandBufferDesc Desc{}; + auto imm_cmd_list = (Partition->MIsInOrderGraph && !MEnableProfiling); +#if FORCE_IMM_CMD_LIST == 1 + imm_cmd_list = true; +#endif + sycl::detail::pi::PiExtCommandBufferDesc Desc{ + pi_ext_structure_type::PI_EXT_STRUCTURE_TYPE_COMMAND_BUFFER_DESC, nullptr, + pi_bool(imm_cmd_list), pi_bool(MEnableProfiling)}; auto ContextImpl = sycl::detail::getSyclObjImpl(MContext); const sycl::detail::PluginPtr &Plugin = ContextImpl->getPlugin(); auto DeviceImpl = sycl::detail::getSyclObjImpl(Device); @@ -950,6 +959,9 @@ exec_graph_impl::enqueue(const std::shared_ptr &Queue, NewEvent->attachEventToComplete(Elem.second); } } + if (!MEnableProfiling) { + NewEvent->setProfilingEnabled(false); + } sycl::event QueueEvent = sycl::detail::createSyclObjFromImpl(NewEvent); return QueueEvent; @@ -1160,12 +1172,12 @@ void modifiable_command_graph::make_edge(node &Src, node &Dest) { } command_graph -modifiable_command_graph::finalize(const sycl::property_list &) const { +modifiable_command_graph::finalize(const sycl::property_list &PropList) const { // Graph is read and written in this scope so we lock // this graph with full priviledges. graph_impl::WriteLock Lock(impl->MMutex); - return command_graph{this->impl, - this->impl->getContext()}; + return command_graph{ + this->impl, this->impl->getContext(), PropList}; } bool modifiable_command_graph::begin_recording(queue &RecordingQueue) { @@ -1279,8 +1291,9 @@ std::vector modifiable_command_graph::get_root_nodes() const { } executable_command_graph::executable_command_graph( - const std::shared_ptr &Graph, const sycl::context &Ctx) - : impl(std::make_shared(Ctx, Graph)) { + const std::shared_ptr &Graph, const sycl::context &Ctx, + const property_list &PropList) + : impl(std::make_shared(Ctx, Graph, PropList)) { finalizeImpl(); // Create backend representation for executable graph } diff --git a/sycl/source/detail/graph_impl.hpp b/sycl/source/detail/graph_impl.hpp index 6793ab0b2229f..e12e50cd34476 100644 --- a/sycl/source/detail/graph_impl.hpp +++ b/sycl/source/detail/graph_impl.hpp @@ -331,6 +331,24 @@ class node_impl { } } + /// Test if the node contains a N-D copy + /// @return true if the op is a N-D copy + bool isNDCopyNode() const { + if ((MCGType == sycl::detail::CG::CGTYPE::CopyAccToAcc) || + (MCGType == sycl::detail::CG::CGTYPE::CopyAccToPtr) || + (MCGType == sycl::detail::CG::CGTYPE::CopyPtrToAcc)) { + sycl::detail::CGCopy *Copy = (sycl::detail::CGCopy *)MCommandGroup.get(); + sycl::detail::Requirement *ReqSrc = + (sycl::detail::Requirement *)(Copy->getSrc()); + sycl::detail::Requirement *ReqDst = + (sycl::detail::Requirement *)(Copy->getDst()); + if ((ReqSrc->MDims > 1) || (ReqDst->MDims > 1)) { + return true; + } + } + return false; + } + private: /// Prints Node information to Stream. /// @param Stream Where to print the Node information @@ -550,6 +568,9 @@ class partition { MPiCommandBuffers; /// List of predecessors to this partition. std::vector> MPredecessors; + /// True if the graph of this partition is a single path graph + /// and in-order optmization can be applied on it. + bool MIsInOrderGraph = false; /// @return True if the partition contains a host task bool isHostTask() const { @@ -557,6 +578,25 @@ class partition { sycl::detail::CG::CGTYPE::CodeplayHostTask)); } + /// Checks if the graph is single path, i.e. each node has a single successor. + /// If so, the MIsInOrderGraph flag is set. + void checkIfGraphIsSinglePath() { + MIsInOrderGraph = true; + if (MRoots.size() > 1) { + MIsInOrderGraph = false; + return; + } + for (const auto &Node : MSchedule) { + // In version 1.3.28454 of the L0 driver, 2D Copy ops cannot not + // be enqueued in an in-order cmd-list (causing execution to stall). + // The 2D Copy test should be removed from here when the bug is fixed. + if ((Node->MSuccessors.size() > 1) || (Node->isNDCopyNode())) { + MIsInOrderGraph = false; + return; + } + } + } + /// Add nodes to MSchedule. void schedule(); }; @@ -1021,10 +1061,14 @@ class exec_graph_impl { /// nodes). /// @param Context Context to create graph with. /// @param GraphImpl Modifiable graph implementation to create with. + /// @param PropList List of properties for constructing this object. exec_graph_impl(sycl::context Context, - const std::shared_ptr &GraphImpl) + const std::shared_ptr &GraphImpl, + const property_list &PropList) : MSchedule(), MGraphImpl(GraphImpl), MPiSyncPoints(), MContext(Context), - MRequirements(), MExecutionEvents() { + MRequirements(), MExecutionEvents(), + MEnableProfiling( + PropList.has_property()) { // Copy nodes from GraphImpl and merge any subgraph nodes into this graph. duplicateNodes(); } @@ -1192,6 +1236,8 @@ class exec_graph_impl { std::vector> MPartitions; /// Storage for copies of nodes from the original modifiable graph. std::vector> MNodeStorage; + /// If true, the graph profiling is enabled. + bool MEnableProfiling = false; }; } // namespace detail diff --git a/sycl/test-e2e/Graph/event_profiling_info.cpp b/sycl/test-e2e/Graph/event_profiling_info.cpp index c94117ea655ec..24fc478223caa 100644 --- a/sycl/test-e2e/Graph/event_profiling_info.cpp +++ b/sycl/test-e2e/Graph/event_profiling_info.cpp @@ -130,8 +130,12 @@ int main() { KernelGraph.end_recording(Queue); - auto CopyGraphExec = CopyGraph.finalize(); - auto KernelGraphExec = KernelGraph.finalize(); + // The `enable_profiling` property must be passed to the finalize function + // in order to query profiling information. + auto CopyGraphExec = + CopyGraph.finalize(exp_ext::property::graph::enable_profiling{}); + auto KernelGraphExec = + KernelGraph.finalize(exp_ext::property::graph::enable_profiling{}); event CopyEvent, KernelEvent1, KernelEvent2; // Run graphs diff --git a/sycl/unittests/Extensions/CommandGraph/Exceptions.cpp b/sycl/unittests/Extensions/CommandGraph/Exceptions.cpp index dd687551355e1..51fc44466691c 100644 --- a/sycl/unittests/Extensions/CommandGraph/Exceptions.cpp +++ b/sycl/unittests/Extensions/CommandGraph/Exceptions.cpp @@ -534,3 +534,31 @@ TEST_F(CommandGraphTest, ProfilingException) { std::string::npos); } } + +TEST_F(CommandGraphTest, ProfilingExceptionProperty) { + Graph.begin_recording(Queue); + auto Event1 = Queue.submit( + [&](sycl::handler &cgh) { cgh.single_task>([]() {}); }); + Graph.end_recording(Queue); + + // Checks exception thrown if profiling is requested while the + // enable_profiling property has not been passed to `finalize()`. + auto GraphExecInOrder = Graph.finalize(); + queue QueueProfile{Dev, {sycl::property::queue::enable_profiling()}}; + auto EventInOrder = QueueProfile.submit( + [&](handler &CGH) { CGH.ext_oneapi_graph(GraphExecInOrder); }); + QueueProfile.wait_and_throw(); + bool Success = true; + try { + EventInOrder + .get_profiling_info(); + } catch (sycl::exception &Exception) { + ASSERT_FALSE(std::string(Exception.what()) + .find("Profiling information is unavailable as the queue " + "associated with the event does not have the " + "'enable_profiling' property.") == + std::string::npos); + Success = false; + } + ASSERT_EQ(Success, false); +}