Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze or error using function MPI_Comm_connect #13076

Open
mixen56 opened this issue Feb 4, 2025 · 0 comments
Open

Freeze or error using function MPI_Comm_connect #13076

mixen56 opened this issue Feb 4, 2025 · 0 comments

Comments

@mixen56
Copy link

mixen56 commented Feb 4, 2025

Background information

What version of Open MPI are you using?

mpirun (Open MPI) 5.0.6

Describe how Open MPI was installed

Source release tarball.

./configure --prefix=/opt/openmpi-5.0.6 --with-pmix=internal

Please describe the system on which you are running

  • Operating system/version: debian 12.5
  • Computer hardware: x86_64 Intel(R) Core(TM) i3-13100

Details of the problem

I have a test which works fine with openmpi-4.x.x. But this application does not work with openmpi-5.0.6 (latest version current time).
Test checks for MPI_Comm_spawn, MPI_Comm_connect, MPI_Comm_disconnect functions of MPI. Test had been copied from mpich repo: https://raw.githubusercontent.com/pmodels/mpich/refs/heads/main/test/mpi/spawn/disconnect_reconnect.c. For compiling it's necessary to have this directory from source: test/mpi.

Compile:

/opt/openmpi-5.0.6/bin/mpicc src/spawn/disconnect_reconnect.c -o disconnect_reconnect -I src/include -I /opt/openmpi-5.0.6/include -L /opt/openmpi-5.0.6/lib src/util/mtest.c

Run:

MPITEST_VERBOSE=1 /opt/openmpi-5.0.6/bin/mpirun --allow-run-as-root -np 1 ./disconnect_reconnect    # with verbose
/opt/openmpi-5.0.6/bin/mpirun --allow-run-as-root -np 1 ./disconnect_reconnect                      # no verbose

Output:

  1. Freeze
  2. Or error
[0] accepting connection
[0] connecting to port (loop 1)
[1] connecting to port (loop 1)
[2] connecting to port (loop 1)
[mongoose:00000] *** An error occurred in MPI_Comm_accept
[mongoose:00000] *** reported by process [1767047169,0]
[mongoose:00000] *** on communicator MPI_COMM_WORLD
[mongoose:00000] *** MPI_ERR_UNKNOWN: unknown error
[mongoose:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mongoose:00000] ***    and MPI will try to terminate your MPI job as well)
[mongoose:00000] *** An error occurred in Socket closed
[mongoose:00000] *** reported by process [1767047170,2]
[mongoose:00000] *** on a NULL communicator
[mongoose:00000] *** Unknown error
[mongoose:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mongoose:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun has exited due to process rank 1 with PID 0 on node mongoose calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------

Questions

  1. Is it normal, that test fails with openmpi-5? Isn't this a violation of the MPI standard? Maybe it's better to go back to the ompi-server?
  2. What is the best solution to make this test able to work? I found Error using MPI_Comm_connect/MPI_Comm_accept #6916, Comm_connect/accept fails openpmix/prrte#398, but this solution goes beyond the scope MPI. Also this needs test revision (ptre run as additional execution).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant