Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000 #10087

Open
jinz2014 opened this issue Aug 25, 2024 · 14 comments
Open

rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000 #10087

jinz2014 opened this issue Aug 25, 2024 · 14 comments
Assignees
Labels

Comments

@jinz2014
Copy link

Describe the issue

[1724610589.249079] [cousteau:2779987:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000
[1724610589.249092] [cousteau:2779986:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fd7af610000/8000
[cousteau:2779987:0:2779987] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed [cousteau:2779986:0:2779986] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed

Steps to Reproduce

export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR

export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install

export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make install

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH

The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip

make run

Setup and versions

  • GPU: AMD MI100
  • ROCm: 6.0.2
@jinz2014 jinz2014 added the Bug label Aug 25, 2024
@edgargabriel
Copy link
Contributor

@jinz2014 this is most likely a system setup / permission issue on your side, since UCX 1.15 has been used extensively with numerous application on MI100.

Can you please check the following things:

@jinz2014
Copy link
Author

The answers are yes to both questions.
I didn't paste the result completely. The program starts to produce error message after initial successful execution

Verified allreduce for size 0 (19.865 us per iteration)
Verified allreduce for size 32 (52.7884 us per iteration)
Verified allreduce for size 256 (94.3108 us per iteration)
Verified allreduce for size 1024 (73.2143 us per iteration)
Verified allreduce for size 4096 (88.3691 us per iteration)
[1724605863.595828] [cousteau:2757379:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f8d0fa10000/8000
[1724605863.595828] [cousteau:2757380:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7effaba18000/8000
[cousteau:2757380:0:2757380] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed [cousteau:2757379:0:2757379] rndv.c:1872 Assertion sreq->send.rndv.lanes_count > 0' failed
==== backtrace (tid:2757380) ====
0 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f01d67bbd84]
1 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xc2) [0x7f01d67b8dc2]
2 /home/user/ompi_for_gpu/ucx/lib/libucs.so.0(ucs_fatal_error_format+0x11a) [0x7f01d67b8eea]
3 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_progress_rma_put_zcopy+0x1b8) [0x7f01d68a8a08]
4 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_rndv_atp_handler+0x217) [0x7f01d68a9ac7]
5 /home/user/ompi_for_gpu/ucx/lib/libuct.so.0(+0x1c6ad) [0x7f01cd7916ad]
6 /home/user/ompi_for_gpu/ucx/lib/libucp.so.0(ucp_worker_progress+0x3a) [0x7f01d6859e3a]
7 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(mca_pml_ucx_send+0x1bf) [0x7f01d8bd21df]
8 /home/user/ompi_for_gpu/ompi/lib/libmpi.so.40(MPI_Send+0x183) [0x7f01d8a59b63]
9 ./main() [0x206a9f]
10 ./main() [0x205b6a]
11 ./main() [0x2060cd]
12 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f01d698ed90]
13 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f01d698ee40]
14 ./main() [0x205835]

@edgargabriel
Copy link
Contributor

Could you please provide the full command line that you used? I see that the put_zcopy protocol is being utilized, which is not the default with 1.15, it should be the get_zcopy protocol.

@jinz2014
Copy link
Author

Sorry, I don't know the two protocols.

"make run" shows the full command:

$HOME/ompi_for_gpu/ompi/bin/mpirun -n 2 ./main

Thank you for the instructions.

@edgargabriel
Copy link
Contributor

So just for a test, could you change the command line to the following:

$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main

to see whether it makes a difference?

@jinz2014
Copy link
Author

Ok.

$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main
[1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1724695907.879444] [cousteau:3183448:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1724695907.897366] [cousteau:3183447:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
Verified allreduce for size 0 (20.0202 us per iteration)
Verified allreduce for size 32 (52.4041 us per iteration)
Verified allreduce for size 256 (91.5858 us per iteration)
Verified allreduce for size 1024 (67.5217 us per iteration)
Verified allreduce for size 4096 (79.6616 us per iteration)
[1724695938.071148] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000
[1724695938.071145] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc618000/8000
[1724695938.071299] [cousteau:3183448:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7faefc620000/8000
[1724695938.071304] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f820000/8000
[1724695938.071585] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f818000/8000
[1724695938.071597] [cousteau:3183447:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fa55f810000/8000

@edgargabriel
Copy link
Contributor

Hm. Ok, I will see whether I can reproduce the issue locally. Are there instructions on how to compile the testcode on the github repo?

@edgargabriel edgargabriel self-assigned this Aug 26, 2024
@jinz2014
Copy link
Author

export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR

export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install

export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make install

export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH

The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip

make run

will build and run the program.

The CUDA example is migrated to the HIP example. I didn't observe errors when running the CUDA code, so am not clear where the issue in the HIP example is.
https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-cuda

Thanks

@edgargabriel
Copy link
Contributor

ok, so but just clarify, compiling the example is simply make run ( I am compiling UCX and Open MPI on a daily bases, that is not the challenge :-) )

@jinz2014
Copy link
Author

make run
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c main.cu -o main.o
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c collectives.cu -o collectives.o
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 -c timer.cu -o timer.o
hipcc -std=c++14 -I$HOME/ompi_for_gpu/ompi/include -DOMPI_SKIP_MPICXX= -Wall -O3 main.o collectives.o timer.o -o main -L$HOME/ompi_for_gpu/ompi/lib -lmpi -DOMPI_SKIP_MPICXX=
$HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main

The original CUDA code is https://github.com/baidu-research/baidu-allreduce

@edgargabriel
Copy link
Contributor

edgargabriel commented Aug 26, 2024

I can confirm that I can reproduce the issue. It is in my case an MI250X system with ROCm 6.2 and UCX 1.16 (that is my default development platform at the moment), but the same error is occurring. I will put it on my list of items to work on, but it might be more towards the end of the week until I get to it.

@jinz2014
Copy link
Author

Okay.

@edgargabriel
Copy link
Contributor

edgargabriel commented Sep 6, 2024

I think I know what the issue is, but I do not know yet whether its something that we are doing wrong in the rocm components of UCX or whether its a bug in ROCm runtime layer.

I have however a quick workaround in your code (since a proper fix might take a while):

If you allocate the output buffer outside of the RingAllreduce test and pass it in as an argument to RingAllreduce (e.g. allocate just right before the for(size_t iter = 0; iter < iters; iter++) loop and perform in the for loop body a hipMemset(output, 0, size * sizeof(float)) before calling RingAllreduce), you avoid the hipMalloc() + hipFree() of the buffer for every iteration (and do it just once for every message size). With this modification, the test passes for me.

Let me emphasize that your code is however correct, and it should work.

@jinz2014
Copy link
Author

jinz2014 commented Sep 6, 2024

I added another example
https://github.com/zjin-lcf/HeCBench/blob/master/src/pingpong-hip/main.cu
Does running the example cause similar errors ?

Thank you for the workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants