Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_split crashes with Portals4 #7217

Open
ptaffet opened this issue Dec 3, 2019 · 1 comment
Open

MPI_Comm_split crashes with Portals4 #7217

ptaffet opened this issue Dec 3, 2019 · 1 comment
Assignees

Comments

@ptaffet
Copy link

ptaffet commented Dec 3, 2019

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

I'm using OpenMPI v4.0.1 (essentially b8a8ae9 with a few unrelated changes).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I cloned from Git and built from source.
portals4 was built from a git clone.
Configured with ../configure --prefix=pwd/../_install --with-portals4=/home/pt2/portals4/_install
Portals4 is configured with UDP mode: --disable-transport-ib --enable-zero-mrs --enable-transport-udp --enable-reliable-udp --disable-transport-shmem

Please describe the system on which you are running

Operating system/version: Debian 8
Computer hardware: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
Network type: Ethernet


Details of the problem

Calling MPI_Comm_split causes an immediate crash with the assertion:

 ../../../src/ib/ptl_ct.c:567: ct_check: Assertion `buf->type == BUF_TRIGGERED' failed.
reduce: ../../../src/ib/ptl_ct.c:567: ct_check: Assertion `buf->type == BUF_TRIGGERED' failed.
[bold-node013:14014] *** Process received signal ***
[bold-node013:14014] Signal: Aborted (6)
[bold-node013:14014] Signal code:  (-6)
[bold-node013:14014] [ 1] /home/pt2/openmpi-4.0.1/_build/../_install/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fbc4ba705ac]
[bold-node013:14014] [ 2] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(+0x3105d)[0x7fbc4c80205d]
[bold-node013:14014] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7fb7c5f82266]
[bold-node013:14014] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7fb7c5f82312]
[bold-node013:14014] [ 5] /home/pt2/portals4/_build/../_install/lib/libportals.so.4(+0x84ab)[0x7fb7b9f854ab]
[bold-node013:14014] [ 6] /home/pt2/portals4/_build/../_install/lib/libportals.so.4(PtlCTFree+0xd7)[0x7fb7b9f84cbb]
[bold-node013:14014] [ 7] /home/pt2/openmpi-4.0.1/_build/../_install/lib/openmpi/mca_coll_portals4.so(ompi_coll_portals4_iallreduce_intra_fini+0x15b)[0x7fb7b239c9db]
[bold-node013:14014] [ 8] /home/pt2/openmpi-4.0.1/_build/../_install/lib/openmpi/mca_coll_portals4.so(+0x40c5)[0x7fb7b239d0c5]
[bold-node013:14014] [ 9] /home/pt2/openmpi-4.0.1/_build/../_install/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7fb7c57bb5ac]
[bold-node013:14014] [10] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(+0x3105d)[0x7fb7c654d05d]
[bold-node013:14014] [11] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(ompi_comm_nextcid+0x29)[0x7fb7c654ebc9]
[bold-node013:14014] [12] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(ompi_comm_split+0x3ea)[0x7fb7c654aaca]
[bold-node013:14014] [13] /home/pt2/openmpi-4.0.1/_install/lib/libmpi.so.40(MPI_Comm_split+0xa8)[0x7fb7c65854d8]
[bold-node013:14014] [14] ./reduce[0x40089c]
[bold-node013:14014] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fb7c5f75b45]
[bold-node013:14014] [16] ./reduce[0x400769]
[bold-node013:14014] *** End of error message ***

Since the problem appears to be with MPI_Iallreduce, I tried running this sample program:

#include <mpi.h>
int main(int argc, char** argv) {
        MPI_Init(&argc, &argv);
#if IALLREDUCE
        MPI_Request rq;
        int send, recv;
        MPI_Iallreduce(&send, &recv, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD, &rq);
        MPI_Wait(&rq, MPI_STATUS_IGNORE);
#elif ALLREDUCE
        int send, recv;
        MPI_Allreduce(&send, &recv, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
#else
        MPI_Comm out;
        MPI_Comm_split(MPI_COMM_WORLD, rank/2, rank, &out);
#endif
        MPI_Finalize();
}

MPI_Allreduce works properly, but MPI_Iallreduce and MPI_Comm_split fail.
MPI_Iallreduce crashes with a similar stack trace as the one above.

@tkordenbrock
Copy link
Member

@ptaffet Thanks for the bug report. This looks like a bug in the Portals4 reference implementation. I have opened an issue (sandialabs/portals4#82) over there to track it. Once fixed there, I'll confirm the fix here so this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants