-
Notifications
You must be signed in to change notification settings - Fork 56
Open
Labels
MPIAnything related to MPI communicationAnything related to MPI communicationPR talkbugSomething isn't workingSomething isn't workingcommunication
Description
What happened?
MPI-error when resplitting a (quite) large array
Code snippet triggering the error
import heat as ht
X = ht.random.randn(6505535, 363, 7, 7, split=1, dtype=ht.float32)
print("data created")
X.resplit_(0)
print("done")
with
#!/bin/bash
#SBATCH --clusters=MYCLUSTER
#SBATCH --partition=CPUPARTITION
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=40
#SBATCH --time=00:10:00
#SBATCH --account=MYACCOUNT
source ~/modules_heat.sh
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun python test.py
Interestingly, with 8 MPI-procs per node and 20 cores per task there was no problem anymore.
Error message or erroneous outcome
`mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated` in ` File "src/mpi4py/MPI.src/Comm.pyx", line 1080, in mpi4py.MPI.Comm.Alltoallw
comm.Alltoallw(`
Version
main
Python version
Python 3.10.10 [GCC 12.2.0]
PyTorch version
2.6.0+cu126
MPI version
Open MPI 4.1.5
, mpi4py 4.0.1
Metadata
Metadata
Assignees
Labels
MPIAnything related to MPI communicationAnything related to MPI communicationPR talkbugSomething isn't workingSomething isn't workingcommunication