Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
matyas-streamhpc authored and neon60 committed Nov 28, 2024
1 parent 3b1bc91 commit b1fa2f9
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 45 deletions.
3 changes: 2 additions & 1 deletion .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ APUs
AQL
AXPY
asm
Asynchrony
asynchrony
backtrace
Bitcode
bitcode
Expand Down Expand Up @@ -118,6 +118,7 @@ overindexing
oversubscription
overutilized
parallelizable
parallelized
pixelated
pragmas
preallocated
Expand Down
85 changes: 41 additions & 44 deletions docs/how-to/hip_runtime_api/asynchronous.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. meta::
:description: This topic describes asynchronous concurrent execution in HIP
:keywords: AMD, ROCm, HIP, asynchronous concurrent execution, asynchronous, async
:keywords: AMD, ROCm, HIP, asynchronous concurrent execution, asynchronous, async, concurrent, concurrency

.. _asynchronous_how-to:

Expand Down Expand Up @@ -71,25 +71,25 @@ efficiency.
Overlap of data transfer and kernel execution
===============================================================================

One of the primary benefits of asynchronous operations is the ability to overlap
data transfer with kernel execution, leading to better resource utilization and
improved performance.
One of the primary benefits of asynchronous operations is the ability to
overlap data transfer with kernel execution, leading to better resource
utilization and improved performance.

Querying device capabilities
-------------------------------------------------------------------------------

Some AMD HIP-enabled devices can perform asynchronous memory copy operations to
or from the GPU concurrently with kernel execution. Applications can query this
capability by checking the ``asyncEngineCount`` device property. Devices with an
``asyncEngineCount`` greater than zero support concurrent data transfers.
capability by checking the ``asyncEngineCount`` device property. Devices with
an ``asyncEngineCount`` greater than zero support concurrent data transfers.
Additionally, if host memory is involved in the copy, it should be page-locked
to ensure optimal performance.

Asynchronous memory operations
-------------------------------------------------------------------------------

Asynchronous memory operations allow data to be transferred between the host and
device while kernels are being executed on the GPU. Using operations like
Asynchronous memory operations allow data to be transferred between the host
and device while kernels are being executed on the GPU. Using operations like
:cpp:func:`hipMemcpyAsync`, developers can initiate data transfers without
waiting for the previous operation to complete. This overlap of computation and
data transfer ensures that the GPU is not idle while waiting for data. Examples
Expand All @@ -108,10 +108,10 @@ without blocking other operations. :cpp:func:`hipMemcpyPeerAsync` enables data
transfers between different GPUs, facilitating multi-GPU communication.
Concurrent data transfers are important for applications that require frequent
and large data movements. By overlapping data transfers with computation,
developers can minimize idle times and enhance performance. Proper management of
data transfers can lead to efficient utilization of the memory bandwidth and
reduce bottlenecks. This is particularly important for applications that need to
handle large volumes of data efficiently.
developers can minimize idle times and enhance performance. Proper management
of data transfers can lead to efficient utilization of the memory bandwidth and
reduce bottlenecks. This is particularly important for applications that need
to handle large volumes of data efficiently.

Concurrent data transfers with intra-device copies
-------------------------------------------------------------------------------
Expand All @@ -134,16 +134,17 @@ Synchronous calls
-------------------------------------------------------------------------------

Despite the benefits of asynchronous operations, there are scenarios where
synchronous calls are necessary. Synchronous calls ensure task completion before
moving to the next operation, crucial for data consistency and correct execution
order. For example, :cpp:func:`hipMemcpy` for data transfers waits for
completion before returning control to the host. Similarly, synchronous kernel
launches are used when immediate completion is required. When a synchronous
function is called, control is not returned to the host thread before the device
has completed the requested task. The behavior of the host thread—whether to
yield, block, or spin—can be specified using ``hipSetDeviceFlags`` with specific
flags. Understanding when to use synchronous calls is crucial for managing
execution flow and avoiding data races.
synchronous calls are necessary. Synchronous calls ensure task completion
before moving to the next operation, crucial for data consistency and correct
execution order. For example, :cpp:func:`hipMemcpy` for data transfers waits
for completion before returning control to the host. Similarly, synchronous
kernel launches are used when immediate completion is required. When a
synchronous function is called, control is not returned to the host thread
before the device has completed the requested task. The behavior of the host
thread—whether to yield, block, or spin—can be specified using
``hipSetDeviceFlags`` with specific flags. Understanding when to use
synchronous calls is crucial for managing execution flow and avoiding data
races.

Events for synchronization
-------------------------------------------------------------------------------
Expand Down Expand Up @@ -253,8 +254,6 @@ Example
return 0;
}
.. tab-set::

.. tab-item:: hipStreamWaitEvent

.. code-block:: cpp
Expand Down Expand Up @@ -346,8 +345,6 @@ Example
return 0;
}
.. tab-set::

.. tab-item:: sequential

.. code-block:: cpp
Expand Down Expand Up @@ -416,12 +413,12 @@ HIP Graphs
HIP Graphs provide a way to represent complex workflows as a series of
interconnected tasks. By creating and managing graphs, developers can optimize
dependent task execution. Graphs reduce the overhead associated with launching
individual kernels and memory operations, providing a high-level abstraction for
managing dependencies and synchronizing tasks. Examples include representing a
sequence of kernels and memory operations as a single graph. Using graphs
enhances performance and simplifies complex workflow management. This technique
is particularly useful for applications with intricate dependencies and multiple
execution stages.
individual kernels and memory operations, providing a high-level abstraction
for managing dependencies and synchronizing tasks. Examples include
representing a sequence of kernels and memory operations as a single graph.
Using graphs enhances performance and simplifies complex workflow management.
This technique is particularly useful for applications with intricate
dependencies and multiple execution stages.

For more details, see the :ref:`how_to_HIP_graph` documentation.

Expand All @@ -430,10 +427,10 @@ Example

This example demonstrates the use of HIP Graphs to manage asynchronous
concurrent execution of two kernels. It creates a graph with nodes for the
kernel executions and memory copies, which are then instantiated and launched in
two separate streams. This setup ensures efficient and concurrent execution,
leveraging the high-level abstraction of HIP Graphs to simplify the workflow and
improve performance.
kernel executions and memory copies, which are then instantiated and launched
in two separate streams. This setup ensures efficient and concurrent execution,
leveraging the high-level abstraction of HIP Graphs to simplify the workflow
and improve performance.

.. code-block:: cpp
Expand Down Expand Up @@ -559,12 +556,12 @@ achieving optimal performance. Here are some key strategies to consider:
- minimize synchronization overhead: Synchronize only when necessary to avoid
stalling the GPU and hindering parallelism.

- leverage asynchronous operations: Use asynchronous memory transfers and kernel
launches to overlap computation and data transfer, maximizing resource
- leverage asynchronous operations: Use asynchronous memory transfers and
kernel launches to overlap computation and data transfer, maximizing resource
utilization.

- balance workloads: Distribute tasks efficiently between the host and device to
ensure both are fully utilized. This can significantly enhance application
- balance workloads: Distribute tasks efficiently between the host and device
to ensure both are fully utilized. This can significantly enhance application
responsiveness and performance.

- utilize multiple streams: Create and manage multiple streams to run commands
Expand Down Expand Up @@ -602,7 +599,7 @@ Key profiling metrics include:
identify opportunities to improve concurrency and reduce idle times.

Using profiling tools, developers gain a comprehensive understanding of their
application's performance characteristics, making informed decisions about where
to focus optimization efforts. Regular profiling and adjustments ensure that
applications run
at their best, maintaining high efficiency and performance.
application's performance characteristics, making informed decisions about
where to focus optimization efforts. Regular profiling and adjustments ensure
that applications run at their best, maintaining high efficiency and
performance.

0 comments on commit b1fa2f9

Please sign in to comment.