Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
rdma: simplify init / delay posting rx buffers
Since NCCL generally uses a different thread for initializing the plugin and creating communicators, the quick hack of leaving a mostly configured endpoint from init just resulted in us leaking resources; that endpoint was never actually going to be used. The whole reason for the refcount hack in init was a bug in EFA on P5en at launch where destroying and rapidly creating a new QP when the old QP had rx buffers attached could cause an error. Rather than keep the whole endpoint around and leaking resources, just delay the parts of the operation that were causing races until after init, when the first communicator is created. Now, endpoint creation during init() doesn't post buffers and is immediately destroyed, avoiding the whole leak. And because either listen or connect will be called before a process does any communication, this doesn't impact correctness. Signed-off-by: Brian Barrett <[email protected]>
- Loading branch information