You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm trying to run gpt-neox on LUMI HPC.
But I'm saddly getting errors that look like this:
GPU core dump failed
Memory access fault by GPU node-9 (Agent handle: 0x7d5f990) on address 0x14a1cfe01000. Reason: Unknown.
Memory access fault by GPU node-6 (Agent handle: 0x7d5b060) on address 0x14c2c7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-11 (Agent handle: 0x810fd10) on address 0x152be7e01000. Reason: Unknown.
GPU core dump failed
Memory access fault by GPU node-8 (Agent handle: 0x7d5c290) on address 0x15098be01000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x7d581a0) on address 0x153d9fe01000. Reason: Unknown.
Memory access fault by GPU node-7 (Agent handle: 0x7d5c100) on address 0x153e07e01000. Reason: Unknown.
I think the error is occuring during the training step.
Mainly I have two questions:
Can you give a pointer to a github repo (if it's public) that managed to launch gpt-neox on LUMI?
Is the process for launching on LUMI this? (LUMI uses slurm and requires using singularity containers):
Modify the deepspeed multinode runner to launch the train.py/eval.py/generate.py script in a singularity container.
Write "launcher": "slurm" and "deepspeed_slurm": true in the configuration yaml file.
Do sbatch on a script that contains deepy.py train.py confg.yml.
Previously I had some success in launching Megatron-Deepspeed training on LUMI.
But in Megatron-DeepSpeed the slurm task launching was under the control of the user.
I suspect maybe I'm doing gpt-neox launching incorrectly.
My current approach to launching gpt-neox is:
I have a conda environment activated on the LUMI login node with these packages:
I also modified deepspeed's SlurmRunner in DeepSpeed/deepspeed/launcher/multinode_runner.py to run train.py in
a singularity container with the same packages as listed previously.
I set "launcher": "slurm" and "deepspeed_slurm": true in meg_conf.yml.
I've attached meg_conf.yml, ds_conf.yml and the full output.
Hi, I'm trying to run gpt-neox on LUMI HPC.
But I'm saddly getting errors that look like this:
I think the error is occuring during the training step.
Mainly I have two questions:
"launcher": "slurm"
and"deepspeed_slurm": true
in the configuration yaml file.Previously I had some success in launching Megatron-Deepspeed training on LUMI.
But in Megatron-DeepSpeed the slurm task launching was under the control of the user.
I suspect maybe I'm doing gpt-neox launching incorrectly.
My current approach to launching gpt-neox is:
I have a conda environment activated on the LUMI login node with these packages:
I perform an sbatch on this script:
I also modified deepspeed's SlurmRunner in DeepSpeed/deepspeed/launcher/multinode_runner.py to run train.py in
a singularity container with the same packages as listed previously.
I set
"launcher": "slurm"
and"deepspeed_slurm": true
in meg_conf.yml.I've attached meg_conf.yml, ds_conf.yml and the full output.
Any help would be appreciated.
Thanks!
Ingus
output.txt
meg_conf.yml.txt
ds_conf.yml.txt
The text was updated successfully, but these errors were encountered: