-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream and Graph Based MPI Operations #5
Comments
Looking at the first set of slides, when MPI_Start_enqueue is called on a two-sided operation, when does the match occur, particularly with respect to other issued two-sided operations? When it's enqueued or when it's executed by the assocuated stream? E.G. if a sender calls:
Are the sends guaranteed to be matched in the order 0 then 1 (since start_enqueue happened before the isend), or could they be matched in the other order if the match doen't happen until the stream actually executes the enqueued start? |
Matching would occur based on when the operation gets executed by the stream. I think of streams as being similar to threads. In this case, you have a thread |
Let me ask a slightly different question, then. Is MPI allowed to start matching when Start_enqueue is called, or does it have to wait until the stream triggers the progress engine to start matching? Is there any way the application could observe if MPI started the match early? Obviously MPI has to wait to move data in stream order, but it could be useful for implementations if the match could happen sooner. The main challenge in doing this would be that Start_enqueue is a local operation, and matching is non-local operation, so you couldn't block Start_enqueue to do the match. |
Taking the thread analogy, the start_enqueue operation is simply putting a work descriptor into a queue, not performing an MPI operation in the traditional sense. Therefore, I would expect that MPI shouldn't start matching an enqueued operation until it's actually executed by whatever is taking work out of the queue and executing it. An exception to this would be partitioned operations, which match as soon as the persistent operation is initialized. |
The reason I'm asking is that if MPI can optimistically match when Start_enqueue is called, it can potentially separate the non-local matching portion of the enqueued operation from on-stream one-sided data movement. This really comes down to how much ambiguity is in the standard. If the application cannot observe if MPI started matching when Start_enqueue is called, then MPI could match optimistically and enqueue the appropriate one-sided get or put on the stream, taking matching out of the stream's critical path. If the application can observe if MPI started the match early, however, then this API implicitly requires the stream to either perform matching itself or to synchronize with something else which performs the matching. EDIT: spelling corrections, removed incorrect thoughts on psend matching |
I think this comes back to the discussion on "logically concurrent" ambiguity in the standard. I made an analogy between enqueued operations and threads. If we strengthened that to a semantic (i.e. each queue has the MPI semantics of a thread), then it would be subject to the "logically concurrent" discussion that @Wee-Free-Scot has been leading. Should we land on the side that "logically concurrent" means that the application can't enforce a specific ordering (e.g. by synchronization between queued operations using as an example CUDA events between streams or the host CPU), then I think we could make optimizations like you described. To answer your last question -- with CUDA streams you can create an ordering across streams and with the host CPU using CUDA events. So, it would be possible for an application to observe that operations didn't match in the order that they attempted to create using such synchronization. I would expect the MPI communicator to still be the serialization point for matching when the same communicator is used across streams (ignoring relaxations possible with info keys). We could introduce new info assertions to optimize matching for queued operations. An interesting difference between queues and threads is that the MPI library can actually see queues because they're explicit in the API. |
Assume all above calls are issued on the same stream (a same communicator with serial context), then mixing immediate operations between an "enqueue" operation and the next stream "synchronization" is undefined or illegal. The usage essentially breaks the "serial" context of a stream. An "immediate" operation essentially is an "enqueue" followed with an immediate "synchronization". This interpretation allows the "undefined" scenario. That is, it is possible with a reasonable but likely counter-intuitive outcome. |
Personally, I see this as sufficiently similar to forking a separate CPU thread to be able to reason about that analogous situation and draw conclusions that are valid for the enqueue situation. What I mean is -- calling MPI_Send_init(&req[0]);
MPI_Request *ptrReq = &req[0]; pthread_create(&thread, NULL, &MPI_Start, &ptrReq);
MPI_Isend(&req[1]); Comments:
|
Agreed! We started to play around with this new idea called "MPI Stream" -- pmodels/mpich#5908. @Wee-Free-Scot It certainly can use some of your early input. |
Clarification - the match order for persistent sends is defined by when Send_init is called, not Start (or start enqueue) is called so perhaps there is no ambiguity in the matching order in the example I gave. This was a misunderstanding on my part.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Hui Zhou ***@***.***>
Sent: Sunday, March 27, 2022 12:49:56 PM
To: mpiwg-hybrid/hybrid-issues ***@***.***>
Cc: Patrick Bridges ***@***.***>; Comment ***@***.***>
Subject: Re: [mpiwg-hybrid/hybrid-issues] Stream and Graph Based MPI Operations (#5)
* We should strive to avoid this reliance on the interpretation of "logically concurrent" (as far as possible) in the definition of new interfaces.
Agreed! We started to play around with this new idea called "MPI Stream" -- pmodels/mpich#5908<pmodels/mpich#5908>. @Wee-Free-Scot<https://github.com/Wee-Free-Scot> It certainly can use some of your early input.
—
Reply to this email directly, view it on GitHub<#5 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACQTKTQ34POWWHZGCZOHLNDVCCUVJANCNFSM5F5QHLFQ>.
You are receiving this because you commented.Message ID: ***@***.***>
|
This is only true for partitioned operations. Regular persistent point-to-point has always matched anew each time it is active (conceptually during the starting stage -- MPI_Start or MPI_Start_all -- although in practice the protocol messages might happen any time before the associated completion stage) whereas partitioned point-to-point matches once (conceptually during the initialisation stage -- MPI_Psend_init -- although in practice the protocol messages might happen any time before the first completion stage). MPI-4.0 p107 lines 21-23 (comparison of partitioned and regular persistent point-to-point match order):
MPI-4.0 p107 lines 13-15 (partitioned point-to-point match order defined by initialisation procedure order):
MPI-4.0 p101 line 21 (regular persistent point-to-point must be started to permit matching):
MPI-4.0 p94 lines 24-27 (regular persistent point-to-point forms a half-channel):
Interestingly, I can find no slam-dunk quote from MPI-4.0 stating that regular persistent point-to-point matching order is determined by the starting procedure order. This is implied by the "half-channel" statement and by the "started with MPI_START" statement (quoted above) but there is no equivalent to the ordering statement made for nonblocking point-to-point (see below). MPI-4.0 p74 lines 40-42:
Side-note: we should fix this omission -- we should add a new subsubsection "3.9.1 Semantics of Persistent Communications" and state explicitly the tribal knowledge of the semantic rules pertaining to these operations. |
I have responded with some initial thoughts on the linked issue. Thanks for taking the time to write up your idea clearly. |
Thanks, Dan, this is the source of my confusion. I had initially thought matching was in Start order, read the partitioned spec which explicitly states that they match in order, wanted to make sure about regular persistent requests, read carefully about the state of MPI requests, and then saw the various quotes you provided. From a standards terminology perspective, my uncertainty stems from whether:
This quote from p.71 seems relevant to point 2:
But this could just be that I’m still not fully familiar with all of the relevant terminology and latest abstractions in the standard yet. If so, please excuse me while I get up to speed with standard. It’s been many years since I waded into it in depth and lots has changed. |
I think some confusion is arising from trying to determine whether MPI requests are (or exhibit) operation stages, which they are (do) not. The initialisation stage of an MPI operation creates an MPI request that represents that operation. The MPI operation is inactive (because the starting stage has not been done yet). We call the request an inactive request because it is a request that represents an inactive operation. The starting stage of an MPI operation changes the state of the operation from inactive to active. Any request that represents this operation is now called an active request because it is a request that represents an active operation. The completion stage of an MPI operation changes the state of the operation from active to inactive. Any request that represents this operation is now called an inactive request because it is a request that represents an inactive operation. The freeing stage of an MPI operation deallocates/destroys the request that represents the operation. This can be gleaned (we hope) from the state transition diagrams provided in the Terms chapter (see MPI-4.0 §2.4.1). There is no API to discover-without-permitting-change the stage of the operation represented by a request. Thus,
|
The distinction between the state of MPI requests and the state of MPI operations was indeed my main source of confusion - I realized that right before your message came in, but thank you for the clarification. That said, I'm not sure the standard is clear on whether matching happens as part of initializing an operation (when all information needed for matching is available), or starting an operation (when the data buffers become available). Would not this formulation also capture what the standard is trying to do?
That is, the standard is ambiguous on whether MPI operations match when they are initialized (and all necessary information is available to match them) or when they are starting. This could be resolved, including the distinction between non-partitioned and partitioned communications, either by:
|
The standard is not as clear as I would like (okay, that means ambiguous, doesn't it) about whether the matching order for persistent point-to-point operations is determined by their initialisation procedure calls or their starting procedure calls. It is, however, common knowledge that latter is the correct interpretation and the former interpretation would cause a great deal of surprise to all users and implementors.
|
What
Support for enqueueing MPI operations into accelerator work queues (streams) and compute graphs.
Why
Integration of communication with the computation scheduling model for accelerators improves the programming model, can improve communication/computation overlap, and reduces overheads.
Implementations
Slides
Papers
The text was updated successfully, but these errors were encountered: