Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing progress in barriers #313

Closed
devreal opened this issue Mar 3, 2017 · 36 comments
Closed

Missing progress in barriers #313

devreal opened this issue Mar 3, 2017 · 36 comments

Comments

@devreal
Copy link
Member

devreal commented Mar 3, 2017

The following code may deadlock:

dash::Array<int> arr(N);
if (dash::myid() == 0) {
  for (int i = 0; i < N; ++i) {
    arr[i] = 0; // <- potential deadlock
  }
}
arr.barrier();

The reason here is that the underlying dart_barrier (implemented in terms of MPI_Barrier) does not guarantee progress on any segment of the team.

I thus propose to implement dart_fence(dart_gptr_t) in terms of MPI_Win_fence, which behaves like a barrier but also guarantees that all outstanding RMA requests are completed upon return, and implement dash::Array::barrier() in terms of dart_fence. My guess would be that this behavior of arr.barrier() is what users actually expect (providing progress on the underlying allocation).

Alternative 1: Replace dash::Array::barrier() with dash::Array::fence(), which would require changes to existing code, which relies on progress in barrier().
Alternative 2: Have barrier and fence side-by-side. However, a barrier without progress is actually an operation on a team, not a container.

Of course, not only dash::Array is affected by this but all DASH containers.

@fuchsto
Copy link
Member

fuchsto commented Mar 3, 2017

I totally agree.
In fact, I added the dart_flush* functions to the DART interface and implemented them for dart-mpi some time ago for exactly this reason:

https://github.com/dash-project/dash/blame/development/dart-impl/mpi/src/dart_communication.c#L874

Back then, there was discussion about remote completion in the context of the dart-gaspi implementation without a final resolution, so I didn't integrate it in DASH.

But by now, I think we agree that it should be done.

EDIT: Sorry, you were referring to MPI_Win_fence. Yes, I didn't use MPI_Win_fence because there is no corresponding concept in GASPI.

@devreal
Copy link
Member Author

devreal commented Mar 3, 2017

I thought about flushes as well. However, you have no guarantee that a flush hits all outstanding (remotely issued) RMA operations. A fence is the only viable option there. The question is just how we integrate it 😄

Oh, and as far as I know, dart_flush and the like are in development 😉

@fuchsto
Copy link
Member

fuchsto commented Mar 3, 2017

Well, let's implement dart_fence and use it in the Containers?

We should not use MPI-lingo in DASH, though. I will post a concept proposal here.

@fuchsto
Copy link
Member

fuchsto commented Mar 3, 2017

... actually this nicely fits to the point-to-point synchronization of memory spaces I'm currently specifying. So I will extend these specs by fencing and report back.

@devreal
Copy link
Member Author

devreal commented Mar 3, 2017

Well, let's implement dart_fence and use it in the Containers?

Just preparing a PR for that. The question is rather how this is integrated in Containers (see the alternatives above).

@fuchsto
Copy link
Member

fuchsto commented Mar 3, 2017

Let's create another (dependent) PR for the integration in containers.
From an uptight SOP perspective, the features are unrelated after all ;)

@rkowalewski
Copy link

uhm...wait! We are mixing apples and oranges. Applying MPI Fences in our synchronization model is actually undefined behavior and erroneous. First, MPI Fences are a concept of active target synchronization while we only use passive target. Second, progress and completion are two independent concepts and MPI does not specify anything regarding progress. That is purely implementation specific. So the problem mentioned above will probably only happen with specific MPI libraries and I would say that you used OpenMPI.
So what I would suggest is a more detailed discussion before implementing anything which is semantically not valid.

@rkowalewski rkowalewski self-assigned this Mar 3, 2017
@fuchsto
Copy link
Member

fuchsto commented Mar 3, 2017

Aaaah, of course! MPI_Win_fence logically corresponds to closing and re-opening the epoch, right?

@rkowalewski
Copy link

Yes and it has different consistency guarantees compared to a flush.

@fuchsto
Copy link
Member

fuchsto commented Mar 3, 2017

Yes, sorry, didn't pay attention here.
@HuanZhou2 contributed actual solution, but it's based on an ancient forked codebase and not merged.
The fact that MPI does not specify progress guarantees was the motivation for her work in the first place.

#54

@rkowalewski
Copy link

rkowalewski commented Mar 3, 2017

There are some workarounds and a "portable" one without any additional progress threads which I know so far is a MPI Probe like the following:

int flag = 0;
do {
  MPI_Iprobe( MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_DART_WORLD, &flag, MPI_STATUS_IGNORE);
} while(!flag);

It does nothing but triggering the progress engine and waits for a whatever one-sided message to arrive.

@devreal
Copy link
Member Author

devreal commented Mar 3, 2017

@rkowalewski Shoo, I forgot that we are only working in passive mode. I guess I was too excited so thanks for the correction. However, your solution is not a real one here because it is unknown how many RMA requests we have to wait for. Unless of course we use conventional messages during a custom fence to signal completion between all processes, which sounds like a bad idea.

@devreal
Copy link
Member Author

devreal commented Mar 3, 2017

How about MPI_Ibarrier and MPI_Wait? Would that trigger the progress engine?

@rkowalewski
Copy link

Actually MPI_Iprobe is asynchronous and it does semantically not enforce anything regarding one-sided communication. It is only a dirty hack and works. It does even not have to match the number of one-sided messages. It is only to trigger the progress engine periodically and as soon as one message arrives you can suppose that all messages arrive.
I do not know about the Barrier-Wait construct so we have to test it. But I would assume no. Because why should a non-blocking barrier trigger the progress while a blocking one does not?

@fuchsto
Copy link
Member

fuchsto commented Mar 3, 2017

@devreal Well, if I understand the standard correctly, there is no specification on when the progress engine is triggered, or that a progress engine exists.
So the workarounds like MPI_Iprobe are best practice because "they usually work for known implementations", but the standard doesn't specify it.

@devreal
Copy link
Member Author

devreal commented Mar 3, 2017

Any experience in combining MPI_Ibarrier with MPI_Test and MPI_Win_flush in between?

@rkowalewski
Copy link

Correct and that is why I call it a workaround and dirty hack.

@rkowalewski
Copy link

In terms of the MPI standard we are conform to the semantics in DART. It is a MPI problem in specific libraries and while the implementations are getting better it should be fixed in future versions.
I would suggest to examine which libraries are suffering from this issue and propose specific workarounds for them. I assume this problem happens only with OpenMPI. It is not worth to implement something general if it is an issue only in specific scenarios.

@devreal
Copy link
Member Author

devreal commented Mar 3, 2017

Well, the problem is that MPI does not guarantee progress but our programming relies on progress. So it is the other way around: we need to ensure progress and cannot rely on MPI to provide it. Hence, we need to come up with a general solution. I'm experimenting with MPI_Ibarrier right now, will report back.

@rkowalewski
Copy link

rkowalewski commented Mar 3, 2017

Yes you are right and this is the original purpose of the issue in #54. Integrating that will solve many problems. :)

@devreal
Copy link
Member Author

devreal commented Mar 3, 2017

I'm not sure we should make that our default as it adds quite some complexity but we should discuss this in Garching.

The solution using MPI_Ibarrier seems to work with OpenMPI:

  MPI_Request req;
  MPI_Ibarrier(comm, &req);
  int flag = 0;
  MPI_Test(&req, &flag, MPI_STATUSES_IGNORE);
  while (!flag) {
    MPI_Win_flush_local_all(team_data->window);
    MPI_Test(&req, &flag, MPI_STATUSES_IGNORE);
  }

@rkowalewski
Copy link

OK sounds good. It is more or less the same solution like the Iprobe one. I assume that if you move out the local flush behind the while loop it works as well, does it?
The only argument against this solution from the semantical perspective is that we are now mixing process synchronization (barrier) and memory consistency (flush) to achieve progress. So you cannot achieve any progress unless all units in a team participate. That is the advantage of the Iprobe.

@rkowalewski
Copy link

But anyway it is a workaround. So let's keep it as is and discuss that in Garching. :)

@devreal
Copy link
Member Author

devreal commented Mar 5, 2017

The only argument against this solution from the semantical perspective is that we are now mixing process synchronization (barrier) and memory consistency (flush) to achieve progress. So you cannot achieve any progress unless all units in a team participate. That is the advantage of the Iprobe.

No. The Iprobe solution and my solution are not comparable for another reason: The solution I proposed has the same semantics as MPI_Win_fence but for passive target mode (blocking until all locally and remotely issued RMA operations are guaranteed to be finished), which is required in the use-case I presented. On the other hand, the Iprobe solution does not provide this guarantee since there might remotely issued RMA operations after the Iprobe suceeded and before all processes end up in the barrier so there is still potential for a deadlock. You just cannot rely on the timing between the put/get and the Iprobe so the Ibarrier is necessary. However, it might not matter what is actually used to trigger the progress on the window, the MPI_Test in the loop could be sufficient (but then, why doesn't the normal barrier trigger the progress engine?)

@devreal
Copy link
Member Author

devreal commented Mar 5, 2017

I pushed a version of dart_fence and fixed documentation for the flush functions: https://github.com/dash-project/dash/compare/feat-dart-fence

There is another issue I realized while working on this: All RMA operations in one team are issued on the same dynamic window, to which the allocated segments are attached. The same is true for synchronization operations, i.e., dart_flush and the like. As a consequence, we cannot sync on individual allocations but only on whole teams. I'm not sure what's the reasoning behind this decision but it does not comply with the semantics described in the documentation (synchronization on individual segments) and increases the memory footprint of each allocated segment because of the book-keeping required for translating relative addresses (in the gptr)to absolute addresses (used with dynamic windows).

@fuerlinger @HuanZhou2 Do you remember why we use a dynamic window instead of individual windows for each allocation/registration?

@HuanZhou2
Copy link

Sorry, just saw the issue. I think Roger had already somehow explain the progress thing. Here I'd like to complement a little bit more, Normally, the puting-MPI_Test-method is a typical way to progress an outstanding MPI operation if there is no the separate progress thread (just like what @devreal has proposed). Regarding this method, you guys can refer to the paper "Implementation and performance analysis of non-blocking collective operations for MPI". What I have applied on the progress is just to adopt an independent progress (instead of frequent checking) for the non-blocking one-sided communication operations.

@HuanZhou2
Copy link

@devreal, the reason I used the dynamic window instead of the individual window is try to maximally amortize the window creation overhead.
Try to image when we use the normal window, every time when there is a memory allocation operation, a heavy window creation operation need to be invoked. On contrary, if the dynamic window is applied, we just need to invoke the MPI_Attach locally, which brings comparably low overhead.

@devreal
Copy link
Member Author

devreal commented Mar 6, 2017

@HuanZhou2 Thanks for the explanation. However, with the shared-memory optimization this does not hold anymore because we create a new window on each allocation anyway, right?

@rkowalewski
Copy link

While I have not participated in this design decision it is clear that we cannot really model one-sided communication by active target synchronization as it always requires collective fences. This model may be a good fit in applications with regular access patterns (e.g. stencils). However, DASH provides data structures which should enable irregular random access patterns. Yes we provide algorithms as well but theoretically programmers can apply whatever they without using our algorithms. Active target would not allow that because of the "static" synchronization model.

The same is true for synchronization operations, i.e., dart_flush and the like. As a consequence, we cannot sync on individual allocations but only on whole teams. I'm not sure what's the reasoning behind this decision but it does not comply with the semantics described in the documentation (synchronization on individual segments) and increases the memory footprint of each allocated segment because of the book-keeping required for translating relative addresses (in the gptr)to absolute addresses (used with dynamic windows).

Window creation, communication and synchronization are decoupled in MPI-3 RMA:

  • The creation of a window is obviously collective. Allocation and attaching memory is independent and not collective in MPI-RMA, however, it is in DART. This is more or less required by our design as we want to be able to access any address in our global address space and all units have to broadcast their registered addresses.
  • Then communication and synchronization are decoupled as well. So you can communicate to a specific unit and flush (whether local or remote) pending operations to a specific unit. This is only possible with passive target synchronization and not in active target. Our design does not exploit this possibility and there should definitely be a mechanism for fine-granular point to point synchronization (as @fuchsto pointed out) . This again may be an interesting discussion for the meeting in Garching because this would make dynamic DASH data structures more efficient as well.

In summary, active target synchronization is not a really good fit for our programming model and progress for non-blocking operations cannot be guaranteed even in MPI fences. There may be a situation where units apply a MPI fence and wait for one latecomer unit which is occupied by local computation and does not call any MPI routine.

@HuanZhou2
Copy link

@devreal, I think you may have misunderstood the mixed usage of shared-memory window and dynamic window. Here the shared-memory and dynamic window actually span the same region of memory but server different kind of data transfers, where the shared-memory window serves the intra-node one and the dynamic window serves the inter-node data transfers.

@devreal
Copy link
Member Author

devreal commented Mar 6, 2017

I think you may have misunderstood the mixed usage of shared-memory window and dynamic window. Here the shared-memory and dynamic window actually span the same region of memory but server different kind of data transfers, where the shared-memory window serves the intra-node one and the dynamic window serves the inter-node data transfers.

I totally understand that. My point was that with the use of shared memory windows the argument of using dynamic windows to reduce the number of allocated windows is not valid anymore because we allocate a window on every allocation (using MPI_Win_create_shared instead of just MPI_Alloc_mem). So in fact, one optimization (using shared memory windows to directly access memory on the same node) renders another optimization (using dynamic windows to reduce the number of allocated windows) useless.

@HuanZhou2
Copy link

@devreal, below is the logic:
Assume there is two methods for allocating collective memory:

  1. MPI_Win_window_shared (..., &mem, &shmem_win);
    MPI_Win_window (mem, &win); // this window object needs to be created in any case for the inter-node data transfers.

This is the scheme you are favour of if I understand it correctly.

  1. MPI_Win_window_shared (..., &mem, &shmem_win);
    MPI_Win_attach (mem, d-win); //here the d-win exists already and just attach the created memory to this existing d-win and the d-win is served for the inter-node data transfers in any case

Compared this two methods, we can see that the operation of MPI_Win_attach is much lighter than
MPI_Win_create.

I'd like to add that the usage of MPI_Win_window_shared is to create the shared-memory window object and only serve for the intra-node data transfers. However, when there are inter-node data transfers, we can't just simply use the shmem_win but another "remote-access" window, like the normal-created window or dynamic window.

@rkowalewski
Copy link

rkowalewski commented Mar 6, 2017

My point was that with the use of shared memory windows the argument of using dynamic windows to reduce the number of allocated windows is not valid anymore because we allocate a window on every allocation (using MPI_Win_create_shared instead of just MPI_Alloc_mem). So in fact, one optimization (using shared memory windows to directly access memory on the same node) renders another optimization (using dynamic windows to reduce the number of allocated windows) useless.

However note that the allocation of a shared memory segment does not correspond to a team but to a sub-team which contains all units on a shared memory node. And as @HuanZhou2 explains MPI requires different windows for shared memory and distributed memory. So in order to make it visible for the global team we have to attach the shared memory segment to the global memory. And that does not mean that one optimization renders another one useless as the communication is much more efficient.

@devreal
Copy link
Member Author

devreal commented Mar 6, 2017

Ahh, of course. The collective MPI_Win_allocate_shared is done on the shared communicator, which only involves the processes on the same node. Makes sense then.

On another note: I wasn't able to reproduce the deadlock in a standalone test case. After poking around with my original test I realized that one process was stuck elsewhere (gdb showed it to be in MPI as well but it actually was working on a different window). So, it seems that MPI_Barrier is indeed causing progress (as confirmed by Rolf Rabenseifner in an offline discussion) and there is no need for a dedicated dart_fence. I'm marking this as invalid. Learned a lot, thanks y'all 👍

@devreal devreal closed this as completed Mar 6, 2017
@devreal devreal added the invalid label Mar 6, 2017
@fuchsto
Copy link
Member

fuchsto commented Mar 6, 2017

@HuanZhou2 Thank you for engaging in this discussion, much appreciated!

So, long story short: We should put effort into integrating Huan's approach. There already is a dedicated issue for this task #54 so we should continue further discussion there.

@HuanZhou2
Copy link

Always welcome :). For sure, I will keep the task in mind and try to make progress together with you guys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants