From f81588d88f706685ef8a0f8ef642f6834ea56a27 Mon Sep 17 00:00:00 2001
From: Istvan Kiss <neon60@gmail.com>
Date: Mon, 2 Dec 2024 09:46:33 +0100
Subject: [PATCH] Rewrite some sections

WIP

WIP
---
 docs/how-to/hip_runtime_api/asynchronous.rst | 327 ++++---------------
 1 file changed, 67 insertions(+), 260 deletions(-)

diff --git a/docs/how-to/hip_runtime_api/asynchronous.rst b/docs/how-to/hip_runtime_api/asynchronous.rst
index 62791361de..6572624ffd 100644
--- a/docs/how-to/hip_runtime_api/asynchronous.rst
+++ b/docs/how-to/hip_runtime_api/asynchronous.rst
@@ -8,36 +8,43 @@
 Asynchronous concurrent execution
 *******************************************************************************
 
-Asynchronous concurrent execution empowers developers to achieve efficient
-parallelism and resource utilization. By understanding and implementing key
-concepts and best practices, significant performance improvements are within
-reach. Techniques such as overlapping computation and data transfer, managing
-concurrent kernel execution, and utilizing graphs offer a robust framework for
-optimizing GPU performance. As GPU technology evolves, the principles of
-asynchronous execution remain critical for achieving high throughput and low
-latency. Developers are encouraged to explore and experiment with these
-techniques to fully harness their potential.
+Asynchronous concurrent execution important for efficient parallelism and
+resource utilization, with techniques such as overlapping computation and data
+transfer, managing concurrent kernel execution with streams on single or
+multiple devices or using HIP graphs.
 
 Streams and concurrent execution
 ===============================================================================
 
-Streams play a crucial role in managing the execution order of kernels and
-memory operations on the GPU. By utilizing streams, developers can ensure
-efficient execution of tasks, leading to improved performance and resource
-utilization.
+All asynchronous APIs, such as kernel execution, data movement and potentially
+data allocation/freeing all happen in the context of device streams.
+
+Streams are FIFO buffers of commands to execute in order on a given device.
+Commands which enqueue tasks on a stream all return promptly and the command is
+executed asynchronously. Multiple streams may point to the same device and those
+streams may be fed from multiple concurrent host-side threads. Execution on
+multiple streams may be concurrent but isn't required to be.
 
 Managing streams
 -------------------------------------------------------------------------------
 
-A stream is a sequence of commands that execute in order on the GPU, but
-commands in different streams can run concurrently. Streams enable the overlap
-of computation and data transfer, ensuring continuous GPU activity. To create a
-stream, the function :cpp:func:`hipStreamCreate` is used, returning a handle to
-the newly created stream. Assigning different operations to different streams
-allows multiple tasks to run simultaneously, improving overall performance.
-Proper management of streams is crucial for effective asynchrony. Using streams
-wisely can significantly reduce idle times and enhance the efficiency of GPU
-applications.
+Streams enable the overlap of computation and data transfer, ensuring continuous
+GPU activity. To create a stream, the :cpp:func:`hipStreamCreate`, 
+:cpp:func:`hipStreamCreateWithFlags` and :cpp:func:`hipStreamCreateWithPriority`
+functions are used, returning a handle to the newly created stream. Assigning
+different operations to different streams allows multiple tasks to run
+simultaneously, improving overall performance.
+
+The :cpp:func:`hipStreamSynchronize` function in the HIP API is used to block
+the calling host thread until all previously submitted tasks in a specified HIP
+stream have completed. It ensures that all operations in the given stream, such
+as kernel executions or memory transfers, are finished before the host thread
+proceeds.
+
+.. note::
+
+  If the :cpp:func:`hipStreamSynchronize` function input stream is 0 (or the
+  default stream), it waits for all operations in the default stream to complete.
 
 Concurrent execution between host and device
 -------------------------------------------------------------------------------
@@ -48,25 +55,23 @@ asynchronously using ``hipLaunchKernelDefault`` with a stream, enabling the CPU
 to continue executing other code while the GPU processes the kernel. Similarly,
 memory operations like :cpp:func:`hipMemcpyAsync` can be performed
 asynchronously, allowing data transfers between the host and device without
-blocking the CPU. This concurrent execution model is vital for achieving high
-performance, as it ensures that both the CPU and GPU are utilized efficiently.
-By distributing workloads between the host and device, developers can
-significantly reduce execution time and improve application responsiveness.
+blocking the CPU.
 
 Concurrent kernel execution
 -------------------------------------------------------------------------------
 
 Concurrent execution of multiple kernels on the GPU allows different kernels to
-run simultaneously, leveraging the parallel processing capabilities of the GPU.
-Utilizing multiple streams enables developers to launch kernels concurrently,
-maximizing GPU resource usage. Managing dependencies between kernels is crucial
-for ensuring correct execution order. This can be achieved using
-:cpp:func:`hipStreamWaitEvent`, which allows a kernel to wait for a specific
-event before starting execution. Proper management of concurrent kernel
-execution can lead to significant performance gains, particularly in
-applications with independent tasks that can be parallelized. By maximizing the
-utilization of GPU cores, developers can achieve higher throughput and
-efficiency.
+run simultaneously to maximize GPU resource usage. Managing dependencies between
+kernels is crucial for ensuring correct execution order. This can be achieved
+using :cpp:func:`hipStreamWaitEvent`, which allows a kernel to wait for a
+specific event before starting execution. 
+
+Independent kernels can only run concurrently, if there are enough registers and
+share memories for the kernels. To reach concurrent kernel executions, the
+developer may have to reduce the block size of the kernels. The kernel runtimes
+can be misleading at concurrent kernel runs, that's why during optimization it's
+better to check the trace files, to see if a kernel is blocking another kernel,
+while they are running parallel.
 
 Overlap of data transfer and kernel execution
 ===============================================================================
@@ -94,9 +99,8 @@ and device while kernels are being executed on the GPU. Using operations like
 waiting for the previous operation to complete. This overlap of computation and
 data transfer ensures that the GPU is not idle while waiting for data. Examples
 include launching kernels in one stream while performing data transfers in
-another. By carefully orchestrating these operations, developers can achieve
-significant performance improvements. This technique is especially useful in
-applications with large data sets that need to be processed quickly.
+another. This technique is especially useful in applications with large data
+sets that need to be processed quickly.
 
 Concurrent data transfers
 -------------------------------------------------------------------------------
@@ -108,10 +112,9 @@ without blocking other operations. :cpp:func:`hipMemcpyPeerAsync` enables data
 transfers between different GPUs, facilitating multi-GPU communication.
 Concurrent data transfers are important for applications that require frequent
 and large data movements. By overlapping data transfers with computation,
-developers can minimize idle times and enhance performance. Proper management
-of data transfers can lead to efficient utilization of the memory bandwidth and
-reduce bottlenecks. This is particularly important for applications that need
-to handle large volumes of data efficiently.
+developers can minimize idle times and enhance performance. This is particularly
+important for applications that need to handle large volumes of data
+efficiently.
 
 Concurrent data transfers with intra-device copies
 -------------------------------------------------------------------------------
@@ -126,39 +129,30 @@ the same device.
 Synchronization, event management and synchronous calls
 ===============================================================================
 
-Synchronization and event management are crucial for coordinating tasks and
-ensuring correct execution order, while synchronous calls are sometimes
+Synchronization and event management are important for coordinating tasks and
+ensuring correct execution order, and synchronous calls are
 necessary for maintaining data consistency.
 
 Synchronous calls
 -------------------------------------------------------------------------------
 
-Despite the benefits of asynchronous operations, there are scenarios where
-synchronous calls are necessary. Synchronous calls ensure task completion
-before moving to the next operation, crucial for data consistency and correct
-execution order. For example, :cpp:func:`hipMemcpy` for data transfers waits
-for completion before returning control to the host. Similarly, synchronous
-kernel launches are used when immediate completion is required. When a
-synchronous function is called, control is not returned to the host thread
-before the device has completed the requested task. The behavior of the host
-thread—whether to yield, block, or spin—can be specified using
-``hipSetDeviceFlags`` with specific flags. Understanding when to use
-synchronous calls is crucial for managing execution flow and avoiding data
-races.
+Synchronous calls ensure task completion before moving to the next operation.
+For example, :cpp:func:`hipMemcpy` for data transfers waits for completion
+before returning control to the host. Similarly, synchronous kernel launches are
+used when immediate completion is required. When a synchronous function is
+called, control is not returned to the host thread before the device has
+completed the requested task. The behavior of the host thread—whether to yield,
+block, or spin—can be specified using ``hipSetDeviceFlags`` with specific flags.
+Understanding when to use synchronous calls is important for managing execution
+flow and avoiding data races.
 
 Events for synchronization
 -------------------------------------------------------------------------------
 
-Events are critical for synchronization and tracking asynchronous operation
-progress. By creating an event with :cpp:func:`hipEventCreate` and recording it
-with ``hipEventRecord``, developers can synchronize operations across streams,
+By creating an event with :cpp:func:`hipEventCreate` and recording it with
+``hipEventRecord``, developers can synchronize operations across streams,
 ensuring correct task execution order. :cpp:func:`hipEventSynchronize` allows
 waiting for an event to complete before proceeding with the next operation.
-Events are vital for managing dependencies and maintaining data consistency.
-Proper event usage avoids data races and ensures efficient task execution.
-Leveraging events effectively can optimize performance and enhance the overall
-efficiency of GPU applications.
-
 
 Programmatic dependent launch and synchronization
 -------------------------------------------------------------------------------
@@ -168,10 +162,7 @@ to start before the primary kernel finishes, HIP achieves similar functionality
 using streams and events. By employing :cpp:func:`hipStreamWaitEvent`, it is
 possible to manage the execution order without explicit hardware support. This
 mechanism allows a secondary kernel to launch as soon as the necessary
-conditions are met, even if the primary kernel is still running. Such an
-approach optimizes resource utilization and improves performance by efficiently
-overlapping operations, especially in complex applications with interdependent
-tasks.
+conditions are met, even if the primary kernel is still running. 
 
 Example
 -------------------------------------------------------------------------------
@@ -423,196 +414,12 @@ Example
 HIP Graphs
 ===============================================================================
 
-HIP Graphs provide a way to represent complex workflows as a series of
-interconnected tasks. By creating and managing graphs, developers can optimize
-dependent task execution. Graphs reduce the overhead associated with launching
-individual kernels and memory operations, providing a high-level abstraction
-for managing dependencies and synchronizing tasks. Examples include
-representing a sequence of kernels and memory operations as a single graph.
-Using graphs enhances performance and simplifies complex workflow management.
-This technique is particularly useful for applications with intricate
-dependencies and multiple execution stages.
+HIP graphs offer an efficient alternative to the standard method of launching
+GPU tasks via streams. Comprising nodes for operations and edges for
+dependencies, HIP graphs reduce kernel launch overhead and provide a high-level
+abstraction for managing dependencies and synchronization. By representing
+sequences of kernels and memory operations as a single graph, they simplify
+complex workflows and enhance performance, particularly for applications with
+intricate dependencies and multiple execution stages.
 
 For more details, see the :ref:`how_to_HIP_graph` documentation.
-
-Example
--------------------------------------------------------------------------------
-
-This example demonstrates the use of HIP Graphs to manage asynchronous
-concurrent execution of two kernels. It creates a graph with nodes for the
-kernel executions and memory copies, which are then instantiated and launched
-in two separate streams. This setup ensures efficient and concurrent execution,
-leveraging the high-level abstraction of HIP Graphs to simplify the workflow
-and improve performance.
-
-.. code-block:: cpp
-
-  #include <hip/hip_runtime.h>
-  #include <iostream>
-
-  __global__ void kernel(int *data, int value)
-  {
-      int idx = threadIdx.x + blockIdx.x * blockDim.x;
-      data[idx] = value;
-  }
-
-  int main()
-  {
-      constexpr int N = 1024;
-
-      int *d_data1, *d_data2;
-      int h_data1[N], h_data2[N];
-
-      hipGraph_t graph;
-      hipGraphExec_t graphExec;
-      hipStream_t stream1, stream2;
-      hipGraphNode_t kernelNode1, kernelNode2, memcpyNode1, memcpyNode2;
-      hipKernelNodeParams kernelNodeParams1 = {0};
-      hipKernelNodeParams kernelNodeParams2 = {0};
-      hipMemcpy3DParms memcpyParams1        = {0};
-      hipMemcpy3DParms memcpyParams2        = {0};
-
-      // Allocate device memory
-      hipMalloc(&d_data1, N * sizeof(*d_data1));
-      hipMalloc(&d_data2, N * sizeof(*d_data2));
-
-      // Create streams
-      hipStreamCreate(&stream1);
-      hipStreamCreate(&stream2);
-
-      // Create an empty graph
-      hipGraphCreate(&graph, 0);
-
-      // Define kernel1 node parameters
-      void *kernelArgs1[]              = {&d_data1, &N};
-      kernelNodeParams1.func           = reinterpret_cast<void *>(kernel);
-      kernelNodeParams1.gridDim        = dim3(N / 256);
-      kernelNodeParams1.blockDim       = dim3(256);
-      kernelNodeParams1.sharedMemBytes = 0;
-      kernelNodeParams1.kernelParams   = kernelArgs1;
-      kernelNodeParams1.extra          = nullptr;
-
-      // Define kernel2 node parameters
-      void *kernelArgs2[]              = {&d_data2, &N};
-      kernelNodeParams2.func           = reinterpret_cast<void *>(kernel);
-      kernelNodeParams2.gridDim        = dim3(N / 256);
-      kernelNodeParams2.blockDim       = dim3(256);
-      kernelNodeParams2.sharedMemBytes = 0;
-      kernelNodeParams2.kernelParams   = kernelArgs2;
-      kernelNodeParams2.extra          = nullptr;
-
-      // Add kernel nodes to graph
-      hipGraphAddKernelNode(&kernelNode1, graph, nullptr, 0, &kernelNodeParams1);
-      hipGraphAddKernelNode(&kernelNode2, graph, nullptr, 0, &kernelNodeParams2);
-
-      // Define memcpy node parameters for stream1
-      memcpyParams1.srcArray = nullptr;
-      memcpyParams1.srcPos   = make_hipPos(0, 0, 0);
-      memcpyParams1.dstArray = nullptr;
-      memcpyParams1.dstPos   = make_hipPos(0, 0, 0);
-      memcpyParams1.extent   = make_hipExtent(N * sizeof(*d_data1), 1, 1);
-      memcpyParams1.kind     = hipMemcpyDeviceToHost;
-      memcpyParams1.srcPtr   = make_hipPitchedPtr(d_data1, N * sizeof(*d_data1), N, 1);
-      memcpyParams1.dstPtr   = make_hipPitchedPtr(h_data1, N * sizeof(*d_data1), N, 1);
-
-      // Define memcpy node parameters for stream2
-      memcpyParams2.srcArray = nullptr;
-      memcpyParams2.srcPos   = make_hipPos(0, 0, 0);
-      memcpyParams2.dstArray = nullptr;
-      memcpyParams2.dstPos   = make_hipPos(0, 0, 0);
-      memcpyParams2.extent   = make_hipExtent(N * sizeof(*d_data2), 1, 1);
-      memcpyParams2.kind     = hipMemcpyDeviceToHost;
-      memcpyParams2.srcPtr   = make_hipPitchedPtr(d_data2, N * sizeof(*d_data2), N, 1);
-      memcpyParams2.dstPtr   = make_hipPitchedPtr(h_data2, N * sizeof(*d_data2), N, 1);
-
-      // Add memcpy nodes to graph
-      hipGraphAddMemcpyNode(&memcpyNode1, graph, &kernelNode1, 1, &memcpyParams1);
-      hipGraphAddMemcpyNode(&memcpyNode2, graph, &kernelNode2, 1, &memcpyParams2);
-
-      // Instantiate the graph
-      hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);
-
-      // Launch the graph asynchronously in different streams
-      hipGraphLaunch(graphExec, stream1);
-      hipGraphLaunch(graphExec, stream2);
-
-      // Wait for all operations in both streams to complete
-      hipStreamSynchronize(stream1);
-      hipStreamSynchronize(stream2);
-
-      // Cleanup
-      hipGraphExecDestroy(graphExec);
-      hipGraphDestroy(graph);
-      hipStreamDestroy(stream1);
-      hipStreamDestroy(stream2);
-      hipFree(d_data1);
-      hipFree(d_data2);
-
-      std::cout << "Graph executed with asynchronous concurrent execution." << std::endl;
-
-      return 0;
-  }
-
-Best practices and performance optimization
-===============================================================================
-
-Achieving optimal performance in GPU-accelerated applications involves adhering
-to best practices and continuously tuning the code to ensure efficient resource
-utilization.
-
-Implementing best practices
--------------------------------------------------------------------------------
-
-Following best practices for managing asynchronous operations is crucial for
-achieving optimal performance. Here are some key strategies to consider:
-
-- minimize synchronization overhead: Synchronize only when necessary to avoid
-  stalling the GPU and hindering parallelism.
-
-- leverage asynchronous operations: Use asynchronous memory transfers and
-  kernel launches to overlap computation and data transfer, maximizing resource
-  utilization.
-
-- balance workloads: Distribute tasks efficiently between the host and device
-  to ensure both are fully utilized. This can significantly enhance application
-  responsiveness and performance.
-
-- utilize multiple streams: Create and manage multiple streams to run commands
-  concurrently, reducing idle times and improving overall efficiency.
-
-By implementing these strategies, developers can significantly enhance
-application responsiveness and overall performance. These best practices are
-essential for effective asynchronous operation management and for fully
-leveraging the capabilities of modern GPUs.
-
-Balancing and profiling
--------------------------------------------------------------------------------
-
-Profiling tools help identify bottlenecks by providing detailed insights into
-the execution of GPU-accelerated applications. These tools allow developers to
-visualize how computational tasks and memory operations are distributed across
-different hardware resources. By analyzing these visualizations, developers can
-pinpoint areas where the application may be spending excessive time, such as
-during synchronization points or data transfers.
-
-Key profiling metrics include:
-
-- kernel execution time: Measuring the time spent executing each kernel helps
-  identify which kernels are taking longer than expected and may need
-  optimization.
-
-- memory transfer time: Assessing the duration of data transfers between the
-  host and device can highlight inefficiencies or bottlenecks in memory
-  operations.
-
-- stream utilization: Evaluating how streams are utilized can reveal whether
-  resources are being used effectively or if some streams are underutilized.
-
-- concurrency: Analyzing the overlap of computation and data transfers helps
-  identify opportunities to improve concurrency and reduce idle times.
-
-Using profiling tools, developers gain a comprehensive understanding of their
-application's performance characteristics, making informed decisions about
-where to focus optimization efforts. Regular profiling and adjustments ensure
-that applications run at their best, maintaining high efficiency and
-performance.
\ No newline at end of file