-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from fine grained locking throughout the code base to device and domain level locking #743
base: master
Are you sure you want to change the base?
Conversation
The sendrecv code includes a number of places that were referencing the device object solely to get the max_tag value. Simplify things and reduce cache usage by including a copy of max_tag in the endpoint structure. Signed-off-by: Brian Barrett <[email protected]>
Switch from fine grained locking and forcing Libfabric to support FI_THREAD_SAFE to a large domain-level lock held across all non-trivial Libfabric calls. This enabled us to lower support to FI_THREAD_DOMAIN in a later patch. Signed-off-by: Brian Barrett <[email protected]>
Switch from fine grained locking and forcing Libfabric to support FI_THREAD_SAFE to a large domain-level lock held across all non-trivial Libfabric calls. This enabled us to lower support to FI_THREAD_DOMAIN in a later patch. Signed-off-by: Brian Barrett <[email protected]>
Now that the transports have a big lock around shared accesses in either the domain or device, there is no need for the fine grained locking we had been doing. Remove all the locks from the various utility interfaces. Signed-off-by: Brian Barrett <[email protected]>
Lower the required threading limits to FI_THREAD_DOMAIN, with control progress set to FI_PROGRESS_CONTROL_UNIFIED. On Libfabric providers that use the utility completion queues, this will result in no locks for both the send/recv and cq polling calls inside Libfabric, since we already have domain-level exclusivity in the transports. Signed-off-by: Brian Barrett <[email protected]>
Remove the functionality to create an endpoint per thread. Originally, this was used to create a full domain/endpoint/cq set of objects per thread that created a communicator, but that got messy as the rdma transport created multiple endpoints per thread based on the comm in use. So we ended up in a place where we were creating a domain per thread (sometimes) and an endpoint per thread (always), which was messy but worked. With the switch from requesting FI_THREAD_SAFE to FI_THREAD_DOMAIN and the concurrent switch to domain-level locking for plugin operations, there really isn't much advantage to the endpoint per thread model, so this patch removes all that logic. Signed-off-by: Brian Barrett <[email protected]>
34ddea2
to
aea255c
Compare
The approach looks reasonable to me. When this PR is taken out of draft mode I may come up with some minor comments. |
LGTM -- but all I'm really looking for here is that the big fat lock is taken at every entrypoint and held for the duration. I'm mostly excited that this finally kills the per-req mutex we had -- way overkill considering it was only there for the sake of protecting two counters that just as easily could have been atomics. Would really like to see this get in ASAP, ideally before any holiday downtime. Any specific reason you still have it as WIP? Can rebase most of my branches atop this and test with it if it's just a matter of gaining confidence with testing. |
I halfheartedly worry that we're losing something here with the overall approach -- it's more natural for structures that expect to be shared across threads to take responsibility for their own consistency, and by doing so it's immediately clear at a glance whether something expects/tolerates usage across threads. I wouldn't ever block on it but it's worth mentioning. |
@@ -98,7 +98,6 @@ static size_t eager_max_size = 0; | |||
/* List of comms undergoing deferred cleanup */ | |||
static nccl_ofi_deque_t *s_comm_cleanup_list = NULL; | |||
static nccl_ofi_deque_t *r_comm_cleanup_list = NULL; | |||
static pthread_mutex_t comm_cleanup_list_lock = PTHREAD_MUTEX_INITIALIZER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing but I think you'll still need this one? The cleanup list is at process level, not a domain level.
Two threads could simultaneously close two different communicators, and they'll both try to add their communicators to these cleanup lists.
Switch from finegrained locking in all the various data structures to device and domain level locks. This means that we have slightly higher contention on the locks if we're in a multi-accelerator per process region and not using domain-per-thread (which we do on trainium). In return, the locking code is way easier to understand and we (literally) remove thousands of locks (possible we have way too many request structures). This probably isn't the right end state, but is a major step forward and a good point to checkpoint.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.