diff --git a/.wordlist.txt b/.wordlist.txt index b3b8686678..a7955394f8 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -7,7 +7,7 @@ APUs AQL AXPY asm -Asynchrony +asynchrony backtrace Bitcode bitcode @@ -118,6 +118,7 @@ overindexing oversubscription overutilized parallelizable +parallelized pixelated pragmas preallocated diff --git a/docs/how-to/hip_runtime_api/asynchronous.rst b/docs/how-to/hip_runtime_api/asynchronous.rst index 384d43e688..e7cc62849b 100644 --- a/docs/how-to/hip_runtime_api/asynchronous.rst +++ b/docs/how-to/hip_runtime_api/asynchronous.rst @@ -1,6 +1,6 @@ .. meta:: :description: This topic describes asynchronous concurrent execution in HIP - :keywords: AMD, ROCm, HIP, asynchronous concurrent execution, asynchronous, async + :keywords: AMD, ROCm, HIP, asynchronous concurrent execution, asynchronous, async, concurrent, concurrency .. _asynchronous_how-to: @@ -71,25 +71,25 @@ efficiency. Overlap of data transfer and kernel execution =============================================================================== -One of the primary benefits of asynchronous operations is the ability to overlap -data transfer with kernel execution, leading to better resource utilization and -improved performance. +One of the primary benefits of asynchronous operations is the ability to +overlap data transfer with kernel execution, leading to better resource +utilization and improved performance. Querying device capabilities ------------------------------------------------------------------------------- Some AMD HIP-enabled devices can perform asynchronous memory copy operations to or from the GPU concurrently with kernel execution. Applications can query this -capability by checking the ``asyncEngineCount`` device property. Devices with an -``asyncEngineCount`` greater than zero support concurrent data transfers. +capability by checking the ``asyncEngineCount`` device property. Devices with +an ``asyncEngineCount`` greater than zero support concurrent data transfers. Additionally, if host memory is involved in the copy, it should be page-locked to ensure optimal performance. Asynchronous memory operations ------------------------------------------------------------------------------- -Asynchronous memory operations allow data to be transferred between the host and -device while kernels are being executed on the GPU. Using operations like +Asynchronous memory operations allow data to be transferred between the host +and device while kernels are being executed on the GPU. Using operations like :cpp:func:`hipMemcpyAsync`, developers can initiate data transfers without waiting for the previous operation to complete. This overlap of computation and data transfer ensures that the GPU is not idle while waiting for data. Examples @@ -108,10 +108,10 @@ without blocking other operations. :cpp:func:`hipMemcpyPeerAsync` enables data transfers between different GPUs, facilitating multi-GPU communication. Concurrent data transfers are important for applications that require frequent and large data movements. By overlapping data transfers with computation, -developers can minimize idle times and enhance performance. Proper management of -data transfers can lead to efficient utilization of the memory bandwidth and -reduce bottlenecks. This is particularly important for applications that need to -handle large volumes of data efficiently. +developers can minimize idle times and enhance performance. Proper management +of data transfers can lead to efficient utilization of the memory bandwidth and +reduce bottlenecks. This is particularly important for applications that need +to handle large volumes of data efficiently. Concurrent data transfers with intra-device copies ------------------------------------------------------------------------------- @@ -134,16 +134,17 @@ Synchronous calls ------------------------------------------------------------------------------- Despite the benefits of asynchronous operations, there are scenarios where -synchronous calls are necessary. Synchronous calls ensure task completion before -moving to the next operation, crucial for data consistency and correct execution -order. For example, :cpp:func:`hipMemcpy` for data transfers waits for -completion before returning control to the host. Similarly, synchronous kernel -launches are used when immediate completion is required. When a synchronous -function is called, control is not returned to the host thread before the device -has completed the requested task. The behavior of the host thread—whether to -yield, block, or spin—can be specified using ``hipSetDeviceFlags`` with specific -flags. Understanding when to use synchronous calls is crucial for managing -execution flow and avoiding data races. +synchronous calls are necessary. Synchronous calls ensure task completion +before moving to the next operation, crucial for data consistency and correct +execution order. For example, :cpp:func:`hipMemcpy` for data transfers waits +for completion before returning control to the host. Similarly, synchronous +kernel launches are used when immediate completion is required. When a +synchronous function is called, control is not returned to the host thread +before the device has completed the requested task. The behavior of the host +thread—whether to yield, block, or spin—can be specified using +``hipSetDeviceFlags`` with specific flags. Understanding when to use +synchronous calls is crucial for managing execution flow and avoiding data +races. Events for synchronization ------------------------------------------------------------------------------- @@ -253,8 +254,6 @@ Example return 0; } -.. tab-set:: - .. tab-item:: hipStreamWaitEvent .. code-block:: cpp @@ -346,8 +345,6 @@ Example return 0; } -.. tab-set:: - .. tab-item:: sequential .. code-block:: cpp @@ -416,12 +413,12 @@ HIP Graphs HIP Graphs provide a way to represent complex workflows as a series of interconnected tasks. By creating and managing graphs, developers can optimize dependent task execution. Graphs reduce the overhead associated with launching -individual kernels and memory operations, providing a high-level abstraction for -managing dependencies and synchronizing tasks. Examples include representing a -sequence of kernels and memory operations as a single graph. Using graphs -enhances performance and simplifies complex workflow management. This technique -is particularly useful for applications with intricate dependencies and multiple -execution stages. +individual kernels and memory operations, providing a high-level abstraction +for managing dependencies and synchronizing tasks. Examples include +representing a sequence of kernels and memory operations as a single graph. +Using graphs enhances performance and simplifies complex workflow management. +This technique is particularly useful for applications with intricate +dependencies and multiple execution stages. For more details, see the :ref:`how_to_HIP_graph` documentation. @@ -430,10 +427,10 @@ Example This example demonstrates the use of HIP Graphs to manage asynchronous concurrent execution of two kernels. It creates a graph with nodes for the -kernel executions and memory copies, which are then instantiated and launched in -two separate streams. This setup ensures efficient and concurrent execution, -leveraging the high-level abstraction of HIP Graphs to simplify the workflow and -improve performance. +kernel executions and memory copies, which are then instantiated and launched +in two separate streams. This setup ensures efficient and concurrent execution, +leveraging the high-level abstraction of HIP Graphs to simplify the workflow +and improve performance. .. code-block:: cpp @@ -559,12 +556,12 @@ achieving optimal performance. Here are some key strategies to consider: - minimize synchronization overhead: Synchronize only when necessary to avoid stalling the GPU and hindering parallelism. -- leverage asynchronous operations: Use asynchronous memory transfers and kernel - launches to overlap computation and data transfer, maximizing resource +- leverage asynchronous operations: Use asynchronous memory transfers and + kernel launches to overlap computation and data transfer, maximizing resource utilization. -- balance workloads: Distribute tasks efficiently between the host and device to - ensure both are fully utilized. This can significantly enhance application +- balance workloads: Distribute tasks efficiently between the host and device + to ensure both are fully utilized. This can significantly enhance application responsiveness and performance. - utilize multiple streams: Create and manage multiple streams to run commands @@ -602,7 +599,7 @@ Key profiling metrics include: identify opportunities to improve concurrency and reduce idle times. Using profiling tools, developers gain a comprehensive understanding of their -application's performance characteristics, making informed decisions about where -to focus optimization efforts. Regular profiling and adjustments ensure that -applications run -at their best, maintaining high efficiency and performance. \ No newline at end of file +application's performance characteristics, making informed decisions about +where to focus optimization efforts. Regular profiling and adjustments ensure +that applications run at their best, maintaining high efficiency and +performance. \ No newline at end of file