Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

part: multithreading deadlocks fixes and safety checks #12935

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Keluaa
Copy link

@Keluaa Keluaa commented Nov 26, 2024

This PR fixes multiple deadlocks and issues encountered with partitioned communications.

The first deadlocks occur when one thread is in opal_progress and others are working on the partitioned request:

  • MPI_Pready could change the req->flags of a partition while the progress thread is testing it, leading to an edge case where req->done_count would be greater than the number of partitions
  • The changes done by MPI_Pready could be overwritten by the progress thread

Both were fixed by adding the array req->part_ready where MPI_Pready would mark the partitions ready to be sent.
This prevents the progress engine from touching the state of a partition as long as it isn't ready.
Since no atomic operations were added, this should have little to no impact on the performance.

I fixed another deadlock that I rarely encountered at the initialization of the part module: two ompi_comm_idup need to be done, both are started in the progress engine, and will prevent progressing partitioned requests until they are done.
Sometimes the second ompi_comm_idup would be marked as completed in one rank, but not in the other, leading to a deadlock.
This was fixed by doing one ompi_comm_idup at a time, with the side-effect of slowing down the initialization (the first request must be done before starting the next ompi_comm_idup).

A very rare segfault caused by calling mca_part_persist_free_req while ompi_part_persist.lock was unlocked was also fixed.

I added many error checks to mca_part_persist_progress, ensuring no new deadlock can occur when an internal function fails.

For testing, I used a two-way ring exchange with a fixed number of partitions distributed among multiple OpenMP threads. This was my original use-case for which I encountered all those issues since all threads need to call MPI_Pready and MPI_Parrived.
Using up to 128 cores with varying amounts of processes and threads showed no new deadlocks or communication issues.

Partition flags weren't thread safe, as they could be modified from `MPI_Pready` by another thread while the progress engine was running in another.
The new `part_ready` array prevents the progress engine from using flags which may modified in `MPI_Pready`, and vice versa.
This fix only adds one array, and doesn't add any atomic operations: performance shouldn't be impacted.
The `done_count` value was moved outside `mca_part_persist_request_t` to fully prevent the edge-case were it would go above the number of partitions in the request.

Signed-off-by: Keluaa <[email protected]>
Deadlocks were observed when doing multiple `ompi_comm_idup` at once during initialization when multiple threads could call `opal_progress`.
This may be due to a request ordering mismatch, or some other error: the precise cause was not identified.
Doing the `ompi_comm_idup` one at a time avoids the issue.

Signed-off-by: Keluaa <[email protected]>
The previous usage of `MPI_UNDEFINED` would cause greater issues downstream otherwise.
It also removes some compiler warnings with GCC 11.

Signed-off-by: Keluaa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant