-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing progress in barriers #313
Comments
I totally agree. https://github.com/dash-project/dash/blame/development/dart-impl/mpi/src/dart_communication.c#L874 Back then, there was discussion about remote completion in the context of the dart-gaspi implementation without a final resolution, so I didn't integrate it in DASH. But by now, I think we agree that it should be done. EDIT: Sorry, you were referring to |
I thought about flushes as well. However, you have no guarantee that a flush hits all outstanding (remotely issued) RMA operations. A fence is the only viable option there. The question is just how we integrate it 😄 Oh, and as far as I know, |
Well, let's implement We should not use MPI-lingo in DASH, though. I will post a concept proposal here. |
... actually this nicely fits to the point-to-point synchronization of memory spaces I'm currently specifying. So I will extend these specs by fencing and report back. |
Just preparing a PR for that. The question is rather how this is integrated in Containers (see the alternatives above). |
Let's create another (dependent) PR for the integration in containers. |
uhm...wait! We are mixing apples and oranges. Applying MPI Fences in our synchronization model is actually undefined behavior and erroneous. First, MPI Fences are a concept of active target synchronization while we only use passive target. Second, progress and completion are two independent concepts and MPI does not specify anything regarding progress. That is purely implementation specific. So the problem mentioned above will probably only happen with specific MPI libraries and I would say that you used OpenMPI. |
Aaaah, of course! |
Yes and it has different consistency guarantees compared to a flush. |
Yes, sorry, didn't pay attention here. |
There are some workarounds and a "portable" one without any additional progress threads which I know so far is a MPI Probe like the following:
It does nothing but triggering the progress engine and waits for a whatever one-sided message to arrive. |
@rkowalewski Shoo, I forgot that we are only working in passive mode. I guess I was too excited so thanks for the correction. However, your solution is not a real one here because it is unknown how many RMA requests we have to wait for. Unless of course we use conventional messages during a custom |
How about |
Actually MPI_Iprobe is asynchronous and it does semantically not enforce anything regarding one-sided communication. It is only a dirty hack and works. It does even not have to match the number of one-sided messages. It is only to trigger the progress engine periodically and as soon as one message arrives you can suppose that all messages arrive. |
@devreal Well, if I understand the standard correctly, there is no specification on when the progress engine is triggered, or that a progress engine exists. |
Any experience in combining |
Correct and that is why I call it a workaround and dirty hack. |
In terms of the MPI standard we are conform to the semantics in DART. It is a MPI problem in specific libraries and while the implementations are getting better it should be fixed in future versions. |
Well, the problem is that MPI does not guarantee progress but our programming relies on progress. So it is the other way around: we need to ensure progress and cannot rely on MPI to provide it. Hence, we need to come up with a general solution. I'm experimenting with |
Yes you are right and this is the original purpose of the issue in #54. Integrating that will solve many problems. :) |
I'm not sure we should make that our default as it adds quite some complexity but we should discuss this in Garching. The solution using
|
OK sounds good. It is more or less the same solution like the Iprobe one. I assume that if you move out the local flush behind the while loop it works as well, does it? |
But anyway it is a workaround. So let's keep it as is and discuss that in Garching. :) |
No. The Iprobe solution and my solution are not comparable for another reason: The solution I proposed has the same semantics as |
I pushed a version of There is another issue I realized while working on this: All RMA operations in one team are issued on the same dynamic window, to which the allocated segments are attached. The same is true for synchronization operations, i.e., @fuerlinger @HuanZhou2 Do you remember why we use a dynamic window instead of individual windows for each allocation/registration? |
Sorry, just saw the issue. I think Roger had already somehow explain the progress thing. Here I'd like to complement a little bit more, Normally, the puting-MPI_Test-method is a typical way to progress an outstanding MPI operation if there is no the separate progress thread (just like what @devreal has proposed). Regarding this method, you guys can refer to the paper "Implementation and performance analysis of non-blocking collective operations for MPI". What I have applied on the progress is just to adopt an independent progress (instead of frequent checking) for the non-blocking one-sided communication operations. |
@devreal, the reason I used the dynamic window instead of the individual window is try to maximally amortize the window creation overhead. |
@HuanZhou2 Thanks for the explanation. However, with the shared-memory optimization this does not hold anymore because we create a new window on each allocation anyway, right? |
While I have not participated in this design decision it is clear that we cannot really model one-sided communication by active target synchronization as it always requires collective fences. This model may be a good fit in applications with regular access patterns (e.g. stencils). However, DASH provides data structures which should enable irregular random access patterns. Yes we provide algorithms as well but theoretically programmers can apply whatever they without using our algorithms. Active target would not allow that because of the "static" synchronization model.
Window creation, communication and synchronization are decoupled in MPI-3 RMA:
In summary, active target synchronization is not a really good fit for our programming model and progress for non-blocking operations cannot be guaranteed even in MPI fences. There may be a situation where units apply a MPI fence and wait for one latecomer unit which is occupied by local computation and does not call any MPI routine. |
@devreal, I think you may have misunderstood the mixed usage of shared-memory window and dynamic window. Here the shared-memory and dynamic window actually span the same region of memory but server different kind of data transfers, where the shared-memory window serves the intra-node one and the dynamic window serves the inter-node data transfers. |
I totally understand that. My point was that with the use of shared memory windows the argument of using dynamic windows to reduce the number of allocated windows is not valid anymore because we allocate a window on every allocation (using |
@devreal, below is the logic:
This is the scheme you are favour of if I understand it correctly.
Compared this two methods, we can see that the operation of MPI_Win_attach is much lighter than I'd like to add that the usage of MPI_Win_window_shared is to create the shared-memory window object and only serve for the intra-node data transfers. However, when there are inter-node data transfers, we can't just simply use the shmem_win but another "remote-access" window, like the normal-created window or dynamic window. |
However note that the allocation of a shared memory segment does not correspond to a team but to a sub-team which contains all units on a shared memory node. And as @HuanZhou2 explains MPI requires different windows for shared memory and distributed memory. So in order to make it visible for the global team we have to attach the shared memory segment to the global memory. And that does not mean that one optimization renders another one useless as the communication is much more efficient. |
Ahh, of course. The collective On another note: I wasn't able to reproduce the deadlock in a standalone test case. After poking around with my original test I realized that one process was stuck elsewhere ( |
@HuanZhou2 Thank you for engaging in this discussion, much appreciated! So, long story short: We should put effort into integrating Huan's approach. There already is a dedicated issue for this task #54 so we should continue further discussion there. |
Always welcome :). For sure, I will keep the task in mind and try to make progress together with you guys. |
The following code may deadlock:
The reason here is that the underlying
dart_barrier
(implemented in terms ofMPI_Barrier
) does not guarantee progress on any segment of the team.I thus propose to implement
dart_fence(dart_gptr_t)
in terms ofMPI_Win_fence
, which behaves like a barrier but also guarantees that all outstanding RMA requests are completed upon return, and implementdash::Array::barrier()
in terms ofdart_fence
. My guess would be that this behavior ofarr.barrier()
is what users actually expect (providing progress on the underlying allocation).Alternative 1: Replace
dash::Array::barrier()
withdash::Array::fence()
, which would require changes to existing code, which relies on progress inbarrier()
.Alternative 2: Have
barrier
andfence
side-by-side. However, abarrier
without progress is actually an operation on a team, not a container.Of course, not only
dash::Array
is affected by this but all DASH containers.The text was updated successfully, but these errors were encountered: