-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threading Model #388
base: main
Are you sure you want to change the base?
Threading Model #388
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most interesting question to answer still is if we serialize kernels on all backends for the same execution space instance or not.
|
||
A multi-threaded program structured such that there is a *happens-before* relationship between each call to perform a *Fundamental Operation* will behave equivalently to a single-threaded program that performs the same sequence of *Fundamental Operations*. (Note: This is analogous to ``MPI_THREAD_SERIALIZED``) | ||
|
||
.. Do we actually want to guarantee that every Fundamental Operation is serializing? Should that just mean that we don't require call sites to have *happens-before* relationships, or should they also internally create such *happens-before* relationships? I.e. that the calling threads *synchronize-with* each other at those points? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a key question. My understanding is that we want to serialize parallel dispatch to the same execution space instance but I don't think we want to promise anything with respect to data access outside of kernels.
|
||
*Global Synchronization* creates a *happens-before* relationship between the completion of every *Fundamental Operation* on any *Execution Space Instance* that *happens-before* the *Global Synchronization* and the thread that performs the *Global Synchronization*. | ||
|
||
.. Should the above actually be *synchronizes-with*? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there really much of a difference when we talk about fence?
|
||
* Managed Construction | ||
Managed construction of a Kokkos View performs a *Memory Allocation*, potentially followed by a *Parallel Dispatch* to initialize the memory (depending on whether ``WithoutInitializing`` was passed), potentially followed by a *Synchronization* (if no execution space instance was passed, so that allocation and initialization *happen-before* any subsequent operation that may reference the ``View``'s memory'). | ||
.. Do we want that to be *Global Synchronization* or *Local Synchronization*? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We effectively do a device-wide (or at least execution space-wide) synchronization at the moment, see https://github.com/kokkos/kokkos/blob/5d81422daea73f5a2a69771cc0dfafc19f785003/core/src/Cuda/Kokkos_CudaSpace.cpp#L160-L205. The intent is to make sure that memory can't be accessed before allocation is complete and thus it should be (IMHO) enough to fence the active execution space instance on the current thread.
* *Initialization* | ||
|
||
.. Not just Kokkos::init, but also whatever device-specific or thread-specific stuff we have Legion doing now | ||
|
||
* *Finalization* | ||
|
||
.. Ditto Initialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backends can still only be initialized or finalized once. I'm not quite sure if it's worth mentioning initialization/finalization then. At the very least, we need to clarify what we mean here (execution space instance initialization/finalization maybe sensible).
* *Data Access* | ||
``View::operator()``, to memory that is accessible from the host. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure if we want to promise anything about data access outside of kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to, or else we can't suitably address either usage of unmanaged views, or UVM
* Metadata Query | ||
* Element Access | ||
Element Access performs a Data Access operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure if we need these.
Backend-Specific Details | ||
------------------------ | ||
|
||
.. Local or Global synchronizations below? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be enough to group backends into synchronous and asynchronous backends clarifying that kernels submitted by multiple kernels are serialized (if we decide to make that promise).
* ``CUDA`` and ``HIP`` | ||
|
||
* ``HPX`` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should talk more about parallel dispatch and the behavior of independent threads (without a happens-before relationship between them) accessing the same data.
Possibly also clarifying where we promise that dispatch implies fences (linking to API for parallel_for, parallel_reduce, parallel_scan).
TODO:
|
Document what semantics we actually have around use of multiple threads calling Kokkos
The foundational principles I think we have are that
View::operator()
from host, and equivalent memory access in buffers that wedeep_copy
to/from)