Skip to content

Commit

Permalink
rdma: simplify init / delay posting rx buffers
Browse files Browse the repository at this point in the history
Since NCCL generally uses a different thread for initializing the
plugin and creating communicators, the quick hack of leaving a mostly
configured endpoint from init just resulted in us leaking resources;
that endpoint was never actually going to be used.  The whole reason
for the refcount hack in init was a bug in EFA on P5en at launch where
destroying and rapidly creating a new QP when the old QP had rx buffers
attached could cause an error.

Rather than keep the whole endpoint around and leaking resources,
just delay the parts of the operation that were causing races until
after init, when the first communicator is created.  Now, endpoint
creation during init() doesn't post buffers and is immediately
destroyed, avoiding the whole leak.  And because either listen
or connect will be called before a process does any communication,
this doesn't impact correctness.

Signed-off-by: Brian Barrett <[email protected]>
  • Loading branch information
bwbarrett committed Jan 28, 2025
1 parent 2f90964 commit 57f424b
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 14 deletions.
7 changes: 0 additions & 7 deletions src/nccl_ofi_net.c
Original file line number Diff line number Diff line change
Expand Up @@ -310,14 +310,7 @@ int nccl_net_ofi_create_plugin(nccl_net_ofi_plugin_t **plugin_p)
(properties.regIsGlobal == 0) ? "false" : "true");
NCCL_OFI_INFO(NCCL_NET | NCCL_INIT, "Support for DMA-BUF registrations: %s",
(properties.dmabuf_support == 0) ? "false" : "true");
/* Cause release to not actually free the resources, to speed
* up initialization, since the very same resources will be
* recreated by NCCL soon after initialization to do real
* communication.
*/
base_ep->ref_cnt++;
ret = base_ep->release_ep(base_ep);
base_ep->ref_cnt--;
if (ret != 0) {
goto exit;
}
Expand Down
19 changes: 12 additions & 7 deletions src/nccl_ofi_rdma.c
Original file line number Diff line number Diff line change
Expand Up @@ -5200,6 +5200,12 @@ static int listen(nccl_net_ofi_ep_t *base_ep,

int dev_id = device->base.dev_id;

ret = post_rx_buffs(ep);
if (ret != 0) {
NCCL_OFI_WARN("Error posting rx buffers: %d", ret);
return ret;
}

/* Build handle */
memset(handle, 0, sizeof(nccl_net_ofi_conn_handle_t));
assert(sizeof(handle->ep_name) == sizeof(first_control_rail->local_ep_name));
Expand Down Expand Up @@ -6721,6 +6727,12 @@ static int connect(nccl_net_ofi_ep_t *base_ep,
return -EINVAL;
}

ret = post_rx_buffs(ep);
if (ret != 0) {
NCCL_OFI_WARN("Error posting rx buffers: %d", ret);
return ret;
}

/*
* Take appropriate actions based on connection stage of communicator.
*
Expand Down Expand Up @@ -7252,13 +7264,6 @@ static int nccl_net_ofi_rdma_domain_create_endpoint(nccl_net_ofi_domain_t *base_
goto error;
}

/* Post all rx buffers */
ret = post_rx_buffs(ep);
if (ret != 0) {
NCCL_OFI_WARN("Posting of rx buffers failed!");
goto error;
}

NCCL_OFI_TRACE(NCCL_NET, "RDMA endpoint %p for dev #%d is created",
ep,
device->base.dev_id);
Expand Down

0 comments on commit 57f424b

Please sign in to comment.