Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threading Model #388

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Threading Model #388

wants to merge 1 commit into from

Conversation

PhilMiller
Copy link
Contributor

@PhilMiller PhilMiller commented May 24, 2023

Document what semantics we actually have around use of multiple threads calling Kokkos

The foundational principles I think we have are that

  • We should build on the C++ memory model, since we're directly interacting with it at points (particularly, View::operator() from host, and equivalent memory access in buffers that we deep_copy to/from)
  • An execution space instance is very similar to a C++ thread in terms of relevant semantics
  • All the stuff Kokkos does can be decomposed in terms of a few fundamental operations that play key semantic roles

@PhilMiller
Copy link
Contributor Author

Tagged @msimberg because the HPX backend exposes fairly unique behaviors that I definitely don't understand, so it will probably require its own treatment here.

Tagged @keitaTN and @nmm0 because of their work (as I recall it) on describing an operational semantics for Kokkos

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most interesting question to answer still is if we serialize kernels on all backends for the same execution space instance or not.


A multi-threaded program structured such that there is a *happens-before* relationship between each call to perform a *Fundamental Operation* will behave equivalently to a single-threaded program that performs the same sequence of *Fundamental Operations*. (Note: This is analogous to ``MPI_THREAD_SERIALIZED``)

.. Do we actually want to guarantee that every Fundamental Operation is serializing? Should that just mean that we don't require call sites to have *happens-before* relationships, or should they also internally create such *happens-before* relationships? I.e. that the calling threads *synchronize-with* each other at those points?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a key question. My understanding is that we want to serialize parallel dispatch to the same execution space instance but I don't think we want to promise anything with respect to data access outside of kernels.


*Global Synchronization* creates a *happens-before* relationship between the completion of every *Fundamental Operation* on any *Execution Space Instance* that *happens-before* the *Global Synchronization* and the thread that performs the *Global Synchronization*.

.. Should the above actually be *synchronizes-with*?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there really much of a difference when we talk about fence?


* Managed Construction
Managed construction of a Kokkos View performs a *Memory Allocation*, potentially followed by a *Parallel Dispatch* to initialize the memory (depending on whether ``WithoutInitializing`` was passed), potentially followed by a *Synchronization* (if no execution space instance was passed, so that allocation and initialization *happen-before* any subsequent operation that may reference the ``View``'s memory').
.. Do we want that to be *Global Synchronization* or *Local Synchronization*?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We effectively do a device-wide (or at least execution space-wide) synchronization at the moment, see https://github.com/kokkos/kokkos/blob/5d81422daea73f5a2a69771cc0dfafc19f785003/core/src/Cuda/Kokkos_CudaSpace.cpp#L160-L205. The intent is to make sure that memory can't be accessed before allocation is complete and thus it should be (IMHO) enough to fence the active execution space instance on the current thread.

Comment on lines +24 to +30
* *Initialization*

.. Not just Kokkos::init, but also whatever device-specific or thread-specific stuff we have Legion doing now

* *Finalization*

.. Ditto Initialization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backends can still only be initialized or finalized once. I'm not quite sure if it's worth mentioning initialization/finalization then. At the very least, we need to clarify what we mean here (execution space instance initialization/finalization maybe sensible).

Comment on lines +53 to +55
* *Data Access*
``View::operator()``, to memory that is accessible from the host.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure if we want to promise anything about data access outside of kernels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to, or else we can't suitably address either usage of unmanaged views, or UVM

Comment on lines +108 to +110
* Metadata Query
* Element Access
Element Access performs a Data Access operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure if we need these.

Backend-Specific Details
------------------------

.. Local or Global synchronizations below?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be enough to group backends into synchronous and asynchronous backends clarifying that kernels submitted by multiple kernels are serialized (if we decide to make that promise).

* ``CUDA`` and ``HIP``

* ``HPX``

Copy link
Contributor

@masterleinad masterleinad May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should talk more about parallel dispatch and the behavior of independent threads (without a happens-before relationship between them) accessing the same data.
Possibly also clarifying where we promise that dispatch implies fences (linking to API for parallel_for, parallel_reduce, parallel_scan).

@ajpowelsnl
Copy link
Contributor

TODO:

  • Champions: @masterleinad , @ajpowelsnl
  • Request reviews for:
    • #4385
    • #6051
    • #6151
  • Discuss Kokkos "thread safety" concepts with team
  • Decide where to document "thread safety" behavior , using this PR as a starting point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants