-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests fail if running on mutliple nodes #61
Comments
@devreal Also, does the test fail if you run it separately via runtime parameter |
By implementing the docker openmpi2 container, I discovered the same bug. It is easily reproducible. For the container, see branch |
Okay, I'm marking this bug report as bug detail then. |
Broadcast to local team has been implemented in #90 |
Still seeing the MPI_Bcast fail on latest development using MPICH 3.2:
Stacktrace:
Debugging this a bit, it appears that unit 2 passes root=2 to dart_bcast using a node-local communicator. This is erroneous because root should be 0 on this communicator (unit numbering starts with 0 on all communicators). A translation to team-local unit IDs has to be done. Will commit a fix soon. |
@devreal Thank you, the ID-translation is required of course. Wondering why it worked on SuperMUC. |
I thought it did but I re-checked and it turns out that it crashes with a different error :( This time I end up with a SIGSEGV. On unit 0 I see the following output:
Stack trace:
|
I made some progress I think but I am not done yet, gotta leave for today though. I committed some patches to Right now there seems to be some invalid copying going, Valgrind says:
|
@devreal Thank you for crunching this, I have an idea where the invalid copy is coming from. Let me check my implementations in dart/base first, there might be conceptual issues |
This still fails (works on just one node) with a Segfault (see at the very end):
|
Unfortunately, our cluster is down for maintenance until Thursday. I will try as soon as I get a chance to get on the cluster again. |
Re-ran just now (MPICH v3,2). I see the following errors:
Same error also occured on
For the last one, Valgrind reports (running on two nodes, ppn=4):
Attaching full Valgrind log. Attaching full test run log. |
@devreal Awesome, thank you! So it's just a few singular test cases that are affected by the same bug in the new topology module. That helps a lot! How did you configure / run the Valgrind test? Looks way better than with OpenMPI. |
Current development fails on the following test:
The test fails since
This increments the team counter in DART and thus the test |
@devreal Good catch! The test case is a rather recent addition by @fmoessbauer |
I will have a look at my test. Is it sufficient to just skip the test if it is executed on multiple nodes? |
@fmoessbauer Is this issue resolved by your PR #214 ? |
Yes, the PR disables the unit test for runs on more than one node. |
Can we close this issue then? |
I am running the testsuite on with 4 processes on 2 nodes using MPICH v3.2 and see the following error in the Shared test:
The 16 bytes are a dart gptr that is sent to everyone in the team. I tried to find out where the 280 Byte are coming from and it seems that they stem from
dart__base__host_topology__init
where data is send from one leader unit to all others. Notice the following two log lines:Notice the different leader units operating on the same team eventually calling a single broadcast. Looking at the code, I see the following comment that is highly suspicious:
Maybe someone familiar with this code can comment on whether this might be causing the failing runs on multiple nodes?
The text was updated successfully, but these errors were encountered: