Skip to content

[Bug]: Issue with resplit_ for a (quite) large array #1932

@mrfh92

Description

@mrfh92

What happened?

MPI-error when resplitting a (quite) large array

Code snippet triggering the error

import heat as ht

X = ht.random.randn(6505535, 363, 7, 7, split=1, dtype=ht.float32)

print("data created")

X.resplit_(0)

print("done")

with

#!/bin/bash

#SBATCH --clusters=MYCLUSTER
#SBATCH --partition=CPUPARTITION
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=40
#SBATCH --time=00:10:00
#SBATCH --account=MYACCOUNT

source ~/modules_heat.sh

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun python test.py

Interestingly, with 8 MPI-procs per node and 20 cores per task there was no problem anymore.

Error message or erroneous outcome

`mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated` in `  File "src/mpi4py/MPI.src/Comm.pyx", line 1080, in mpi4py.MPI.Comm.Alltoallw
    comm.Alltoallw(`

Version

main

Python version

Python 3.10.10 [GCC 12.2.0]

PyTorch version

2.6.0+cu126

MPI version

Open MPI 4.1.5, mpi4py 4.0.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    MPIAnything related to MPI communicationPR talkbugSomething isn't workingcommunication

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions