Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Running gpt-neox on AMD-based LUMI HPC centre. #1310

Open
iPRET opened this issue Oct 23, 2024 · 0 comments
Open

[Question] Running gpt-neox on AMD-based LUMI HPC centre. #1310

iPRET opened this issue Oct 23, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@iPRET
Copy link

iPRET commented Oct 23, 2024

Hi, I'm trying to run gpt-neox on LUMI HPC.
But I'm saddly getting errors that look like this:

GPU core dump failed
Memory access fault by GPU node-9 (Agent handle: 0x7d5f990) on address 0x14a1cfe01000. Reason: Unknown.
Memory access fault by GPU node-6 (Agent handle: 0x7d5b060) on address 0x14c2c7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-11 (Agent handle: 0x810fd10) on address 0x152be7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-8 (Agent handle: 0x7d5c290) on address 0x15098be01000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x7d581a0) on address 0x153d9fe01000. Reason: Unknown.
Memory access fault by GPU node-7 (Agent handle: 0x7d5c100) on address 0x153e07e01000. Reason: Unknown.

I think the error is occuring during the training step.

Mainly I have two questions:

  1. Can you give a pointer to a github repo (if it's public) that managed to launch gpt-neox on LUMI?
  2. Is the process for launching on LUMI this? (LUMI uses slurm and requires using singularity containers):
  • Modify the deepspeed multinode runner to launch the train.py/eval.py/generate.py script in a singularity container.
  • Write "launcher": "slurm" and "deepspeed_slurm": true in the configuration yaml file.
  • Do sbatch on a script that contains deepy.py train.py confg.yml.

Previously I had some success in launching Megatron-Deepspeed training on LUMI.
But in Megatron-DeepSpeed the slurm task launching was under the control of the user.
I suspect maybe I'm doing gpt-neox launching incorrectly.

My current approach to launching gpt-neox is:
I have a conda environment activated on the LUMI login node with these packages:

accelerate          0.31.0
annotated-types     0.7.0
apex                1.3.0
bitsandbytes        0.43.2.dev0
certifi             2022.12.7
charset-normalizer  2.1.1
contourpy           1.3.0
cupy                13.0.0b1
cycler              0.12.1
deepspeed           0.14.0
einops              0.8.0
exceptiongroup      1.2.2
fastrlock           0.8.2
filelock            3.16.0
flash_attn          2.6.3
fonttools           4.53.1
fsspec              2024.9.0
hjson               3.1.0
huggingface-hub     0.25.0
idna                3.4
iniconfig           2.0.0
Jinja2              3.1.4
kiwisolver          1.4.7
lion-pytorch        0.1.4
MarkupSafe          2.1.5
matplotlib          3.8.4
megatron-core       0.2.0
mpi4py              3.1.6
mpmath              1.3.0
networkx            3.3
ninja               1.11.1.1
numpy               1.26.4
packaging           24.1
pandas              2.2.3
pillow              10.4.0
pip                 24.2
pluggy              1.5.0
protobuf            5.27.1
psutil              6.0.0
py-cpuinfo          9.0.0
pybind11            2.13.1
pydantic            2.9.1
pydantic_core       2.23.3
pynvml              11.5.3
pyparsing           3.1.4
pytest              8.2.2
python-dateutil     2.9.0.post0
pytorch-triton-rocm 2.3.0+rocm6.2.0.1540b42334
pytz                2024.2
PyYAML              6.0.1
regex               2024.9.11
requests            2.28.1
safetensors         0.4.5
scipy               1.13.1
seaborn             0.13.2
sentencepiece       0.2.0
setuptools          72.1.0
six                 1.16.0
sympy               1.12.1
tokenizers          0.19.1
tomli               2.0.1
torch               2.3.0+rocm6.2.0
torchdata           0.7.1
torchtext           0.18.0+cpu
torchvision         0.18.0+rocm6.2.0
tqdm                4.64.1
transformers        4.41.2
typing_extensions   4.12.2
tzdata              2024.1
urllib3             1.26.13
wheel               0.43.0

I perform an sbatch on this script:

#!/bin/bash

#SBATCH --account project_465001281
#SBATCH --partition dev-g
#SBATCH --exclusive=user
#SBATCH --nodes=1
#SBATCH --gpus-per-node=mi250:8
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --mem=0
#SBATCH --time=00:30:00
#SBATCH --hint=nomultithread
#SBATCH --exclude=nid005138,nid006369,nid005796,nid007382

export MEMORY_OPT_ALLREDUCE_SIZE=125000000

export CUDA_DEVICE_MAX_CONNECTIONS=1

export CC=gcc-12
export CXX=g++-12

set -euo pipefail

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=9999
export WORLD_SIZE=$SLURM_NTASKS

export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
#export OMP_NUM_THREADS=1
#export NCCL_NET_GDR_LEVEL=PHB

module purge
module use /appl/local/training/modules/AI-20240529/
module load singularity-userfilesystems

CMD="/project/project_465001281/IP/gpt-neox/deepy.py \
  /project/project_465001281/IP/gpt-neox/train.py
  /project/project_465001281/IP/gpt-neox/launch_scripts/meg_conf.yml \
  /project/project_465001281/IP/gpt-neox/launch_scripts/ds_conf.yml
  "

$CMD

I also modified deepspeed's SlurmRunner in DeepSpeed/deepspeed/launcher/multinode_runner.py to run train.py in
a singularity container with the same packages as listed previously.
I set "launcher": "slurm" and "deepspeed_slurm": true in meg_conf.yml.

I've attached meg_conf.yml, ds_conf.yml and the full output.

Any help would be appreciated.

Thanks!
Ingus
output.txt
meg_conf.yml.txt
ds_conf.yml.txt

@iPRET iPRET added the bug Something isn't working label Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant