Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing synchronization primitive dart_sync #323

Closed
devreal opened this issue Mar 14, 2017 · 2 comments
Closed

Missing synchronization primitive dart_sync #323

devreal opened this issue Mar 14, 2017 · 2 comments

Comments

@devreal
Copy link
Member

devreal commented Mar 14, 2017

While going through the MPI standard I found the following notable example: http://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf#section.11.7 (MPI standard 3.1, Example 11.6)

Process A:             Process B:
                       window location X
                       store X /* update to private & public copy of B */
MPI_Barrier            MPI_Barrier
MPI_Win_lock_all
MPI_Get(X) /* ok, read from window */
MPI_Win_flush_local(B)
/* read value in X */
MPI_Win_unlock_all

The accompanying text states:

Example 11.7
In the RMA unified model, although the public and private copies of the windows are synchronized, caution must be used when combining load/stores and multi-process synchronization. Although the following example appears correct, the compiler or hardware may delay the store to X after the barrier, possibly resulting in the MPI_GET returning an incorrect value of X.

MPI_BARRIER provides process synchronization, but not memory synchronization. The example could potentially be made safe through the use of compiler- and hardware-specific notations to ensure the store to X occurs before process B enters the MPI_BARRIER. The use of one-sided synchronization calls, as shown in Example 11.6, also ensures the correct result.

I think this is a common use-case for us:

  1. Create a window and share-lock it.
  2. Fill the memory locally.
  3. Synchronize units through barriers and start issuing RMA operations.

Thus, we may run into synchronization issues at some point (although I haven't seen them so far). Two options I see:

  1. Introduce an explicit dart_sync, which calls MPI_Win_fence MPI_Win_sync as suggested on page 456 of the MPI standard (Advice to users under point U5) to make store updates visible to all other threads and processes.
  2. Define our barrier to serve as a memory barrier as well. This won't cover threads though.
@rkowalewski
Copy link

You are right with the above mentioned problem, however, it is actually not an issue in DASH. Just to confirm that I got it: In the unified memory model it is not guaranteed that local updates "happen-before" subsequently issued MPI barriers. And this makes sense as barriers deal only with process synchronization, not memory synchronization.

Suggestion 1 (fence) is not an option as already dicussed in #313. The only alternative is MPI_Win_sync bit this does not help anyway in this case.

Now in DASH, given the following algorithm:

dash::array<int, 1000> arr{};
auto const val = ...;
//initializes local range on each unit
dash::fill(array, val);
//Synchronize all units
array.barrier();

//continue processing....

After the barrier has returned the memory semantics are well defined since we execute a dart_flush in the array barrier. And a flush guarantees consistent synchronized public and private of the MPI window. It corresponds to option 2 as described above.

Now suppose we replace array.barrier with array.team().barrier() or dash::barrier. This does NOT apply any memory synchronization but only a dart_barrier. And this makes sense as we do not synchronize the container but only the team. We should provide explicit control to the user to distinguish between memory and process synchronization in my opinion.

So finally, I think the DART interface is well defined. Implicitly apply memory synchronization in a dart barrier is not really required and even a bad solution if we care about separation of concerns. Moreover it can degrade performance as memory synchronization is expensive.

@devreal
Copy link
Member Author

devreal commented Mar 14, 2017

Suggestion 1 (fence) is not an option as already dicussed in #313. The only alternative is MPI_Win_sync bit this does not help anyway in this case.

Dang, I actually meant to call MPI_Win_sync in dart_sync, my mistake.

After the barrier has returned the memory semantics are well defined since we execute a dart_flush in the array barrier. And a flush guarantees consistent synchronized public and private of the MPI window. It corresponds to option 2 as described above.

I wasn't aware of the flush in the array barrier. However, the flush is missing in Matrix::barrier().

Also, the standard does not mention anything about sync'ing private and public copies (or memory barriers) in the description of MPI_Win_flush*. My understanding is that the synchronization of stores is not covered by these functions, only locally issued RMA operations are guaranteed to be completed. Maybe I am missing something here?

And yes I agree, a memory barrier in dart_barrier is over the top.

@fuchsto
Copy link
Member

fuchsto commented Mar 16, 2017

@devreal The missing flush in dash::NArray / dash::Matrix is addressed in #312

Apart from that, are there missing features or broken use cases to be resolved, or can we close this issue?

@devreal
Copy link
Member Author

devreal commented Mar 16, 2017

The missing flush in dash::NArray / dash::Matrix is addressed in #312

Thanks.

Apart from that, are there missing features or broken use cases to be resolved, or can we close this issue?

No broken use cases at the moment but a cautious warning that this may eventually fire back at us. Afaics, we do not adhere to the MPI standard here but there seems to be a discrepancy in the understanding it between @rkowalewski and myself, which I would like to discuss in Garching before closing this issue.

@devreal
Copy link
Member Author

devreal commented Mar 24, 2017

Closing this after discussions in Garching and realizing that dart_flush calls MPI_Win_sync.

@devreal devreal closed this as completed Mar 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants