-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testsuite hangs in collective operations when using MPICH #56
Comments
@devreal I suppose the MPICH version doesn't support shared windows then? Does it work when building with CMake flag |
Apparently the fixes in #58 resolved this. https://circleci.com/gh/dash-project/dash/211#usage-queue/containers/1 @devreal Please test again with the current state of development and let me know if you can still reproduce this. |
I ran the tests with MPICH v3.2 (CI uses 3.1.6) and it succeeds when running on a single node. However, MPICH reports a faulty MPI_Bcast when running on multiple nodes that I am investigating right now. |
In the meantime, I repair the "suspicious" TODO code segment in the locality runtime module. |
Broadcasting host topology from leaders to local team, addresses #56
I am running the testsuite on our Linux cluster (Laki), this time using MPICH v3.4.1. There seems to be a race condition that leads one process to wait in MPI_Win_free while others are waiting in an allgather. DDT shows the following stack trace:
I'm not sure whether the following message is related to this issue so I post it here for completeness:
When filtering this test, some tests run through until I find another test that hangs. A random example is this:
Used compiler was GCC 4.9.1, PAPI was used in version 5.4.2.0.
The text was updated successfully, but these errors were encountered: