Significant Difference between torchrun launch and accelerate launch #2262

SinclairCoder · 2024-10-21T18:06:56Z

System Info

Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Python version: 3.10.0
PyTorch version: 2.5.0+cu121
CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
Transformers version: 4.46.0.dev0
Accelerate version: 1.0.1
Accelerate config: not found
Datasets version: 3.0.1
HF Hub version: 0.26.0
TRL version: 0.12.0.dev0
bitsandbytes version: not installed
DeepSpeed version: not installed
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: not installed
PEFT version: not installed

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

accelerate launch

accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml  examples/scripts/sft.py \
    --model_name_or_path ${BASE_MODEL_PATH} \
    --dataset_name test \
    --max_seq_length 2048 \
    --dataset_num_proc 8 \
    --torch_dtype auto \
    --output_dir ${CKPT_DIR}/${BASE_NAME}_${DATA_NAME}/${LR}_${BS} \
    --overwrite_output_dir True \
    --learning_rate 2.0e-5 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --logging_steps 10 \
    --save_strategy steps \
    --save_steps 100 \
    --report_to wandb \
    --run_name test

torchrun launch

torchrun --nproc_per_node=8  examples/scripts/sft.py \
    --model_name_or_path ${BASE_MODEL_PATH} \
    --dataset_name test \
    --max_seq_length 2048 \
    --dataset_num_proc 8 \
    --torch_dtype auto \
    --output_dir ${CKPT_DIR}/${BASE_NAME}_${DATA_NAME}/${LR}_${BS} \
    --overwrite_output_dir True \
    --learning_rate 2.0e-5 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --logging_steps 10 \
    --save_strategy steps \
    --save_steps 100 \
    --report_to wandb \
    --run_name test

Expected behavior

I noticed a significant difference in training logs between torchrun launch and accelerate launch

accelerate:

{'loss': 17095.9656, 'grad_norm': nan, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.997287040707416e-06, 'epoch': 0.34}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.984013249524775e-06, 'epoch': 0.4}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.959739037032001e-06, 'epoch': 0.45}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.92457190127261e-06, 'epoch': 0.51}

torchrun:

{'loss': 17.4955, 'grad_norm': 39.65221405029297, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06}
{'loss': 17.304, 'grad_norm': 12.62601375579834, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11}
{'loss': 16.8077, 'grad_norm': 8.29493236541748, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17}
{'loss': 16.5808, 'grad_norm': 5.289371490478516, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23}
{'loss': 16.3465, 'grad_norm': 4.720265865325928, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28}

I would like to know what's wrong with accelerate launch. Thanks!

The text was updated successfully, but these errors were encountered:

qgallouedec added the ❓ question Seeking clarification or more information label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant Difference between torchrun launch and accelerate launch #2262

Significant Difference between torchrun launch and accelerate launch #2262

SinclairCoder commented Oct 21, 2024

Significant Difference between torchrun launch and accelerate launch #2262

Significant Difference between torchrun launch and accelerate launch #2262

Comments

SinclairCoder commented Oct 21, 2024

System Info

Information

Tasks

Reproduction

Expected behavior