part: multithreading deadlocks fixes and safety checks #12935
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes multiple deadlocks and issues encountered with partitioned communications.
The first deadlocks occur when one thread is in
opal_progress
and others are working on the partitioned request:MPI_Pready
could change thereq->flags
of a partition while the progress thread is testing it, leading to an edge case wherereq->done_count
would be greater than the number of partitionsMPI_Pready
could be overwritten by the progress threadBoth were fixed by adding the array
req->part_ready
whereMPI_Pready
would mark the partitions ready to be sent.This prevents the progress engine from touching the state of a partition as long as it isn't ready.
Since no atomic operations were added, this should have little to no impact on the performance.
I fixed another deadlock that I rarely encountered at the initialization of the part module: two
ompi_comm_idup
need to be done, both are started in the progress engine, and will prevent progressing partitioned requests until they are done.Sometimes the second
ompi_comm_idup
would be marked as completed in one rank, but not in the other, leading to a deadlock.This was fixed by doing one
ompi_comm_idup
at a time, with the side-effect of slowing down the initialization (the first request must be done before starting the nextompi_comm_idup
).A very rare segfault caused by calling
mca_part_persist_free_req
whileompi_part_persist.lock
was unlocked was also fixed.I added many error checks to
mca_part_persist_progress
, ensuring no new deadlock can occur when an internal function fails.For testing, I used a two-way ring exchange with a fixed number of partitions distributed among multiple OpenMP threads. This was my original use-case for which I encountered all those issues since all threads need to call
MPI_Pready
andMPI_Parrived
.Using up to 128 cores with varying amounts of processes and threads showed no new deadlocks or communication issues.