-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP RMA: Fix put-based plugin #167
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first bunch of comments
src/ucx_rma_plugin.c
Outdated
NCCL_PARAM(UCXAckDelay, "UCXPUT_ACK_DELAY", 1); | ||
NCCL_PARAM(UCXAckSkip, "UCXPUT_ACK_SKIP", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not
NCCL_PARAM(UCXAckDelay, "UCXPUT_ACK_DELAY", 1); | |
NCCL_PARAM(UCXAckSkip, "UCXPUT_ACK_SKIP", 0); | |
NCCL_PARAM(UCXAckDelay, "UCX_PUT_ACK_DELAY", 1); | |
NCCL_PARAM(UCXAckSkip, "UCX_PUT_ACK_SKIP", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
char inflight; /* Chunk still being sent */ | ||
char reqs; /* Count request alive */ | ||
int sizes[NCCL_UCP_MAX_RECV]; | ||
unsigned short id; /* Id of the origin RTR again */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need two ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, to check if data has fully arrived (although in this case the struct occupies only 1 cacheline)
} | ||
static ncclResult_t nccl_ucx_rma_connect(int dev, void *listen_handle, | ||
void **send_comm, | ||
ncclNetDeviceHandle_t **sendDevComm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not ncclNetDeviceHandle_v7_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep main function with latest struct definition (also matches implementation in the ib_plugin.c file).
anyways we have in net_device.h
:
27 typedef ncclNetDeviceHandle_v7_t ncclNetDeviceHandle_v8_t;
28 typedef ncclNetDeviceHandle_v8_t ncclNetDeviceHandle_t;
src/ucx_rma_plugin.c
Outdated
w->address = malloc(attr.address_length); | ||
if (w->address == NULL) { | ||
ucp_worker_release_address(w->ucp_worker, attr.address); | ||
return ncclSystemError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to destroy worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
UCXCHECK(ucp_worker_query(w->ucp_worker, &attr)); | ||
|
||
if (attr.thread_mode != UCS_THREAD_MODE_MULTI) { | ||
INFO(NCCL_NET, "Thread mode multi is not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/ucx_rma_plugin.c
Outdated
req->comm = comm; | ||
req->type = NCCL_UCP_TYPE_ISEND; | ||
req->rtr_id = rtr->id_start; | ||
req->inflight = (UCS_PTR_STATUS(status_ptr) == UCS_INPROGRESS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
req->inflight = (UCS_PTR_STATUS(status_ptr) == UCS_INPROGRESS); | |
req->inflight = (UCS_PTR_IS_PTR(status_ptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if (UCS_PTR_IS_ERR(req->st)) { | ||
WARN("NET/UCX_RMA: am_send_nb failed"); | ||
return ncclInternalError; | ||
assert(tag != INT_MAX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use this value to mark entries as consumed later
comm->rem_fifo.tail++; | ||
out: | ||
if ((*request == NULL) && (comm->total == 0)) { | ||
ucp_worker_progress(comm->worker->ucp_worker); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depending on the transport, it could be that we need to progress to receive the first remote puts.
(could there also be timeouts to service?).
src/ucx_rma_plugin.c
Outdated
assert((UCS_PTR_STATUS(status_ptr) == UCS_INPROGRESS) || | ||
(UCS_PTR_STATUS(status_ptr) == UCS_OK)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is already checked 2 lines above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
src/ucx_rma_plugin.c
Outdated
status = nccl_ucp_shared_put(comm, atp, sizeof(*atp), remote, | ||
&req->inflight); | ||
req->inflight += (status == UCS_INPROGRESS); | ||
rtr->avail -= (status == UCS_INPROGRESS) || (status == UCS_OK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rtr->avail -= (status == UCS_INPROGRESS) || (status == UCS_OK); | |
rtr->avail -= !UCS_STATUS_IS_ERR(status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Fix UCP RMA plugin.