Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Difference between torchrun launch and accelerate launch #2262

Open
2 of 4 tasks
SinclairCoder opened this issue Oct 21, 2024 · 0 comments
Open
2 of 4 tasks
Labels
❓ question Seeking clarification or more information

Comments

@SinclairCoder
Copy link

System Info

  • Platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
  • Python version: 3.10.0
  • PyTorch version: 2.5.0+cu121
  • CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
  • Transformers version: 4.46.0.dev0
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • Datasets version: 3.0.1
  • HF Hub version: 0.26.0
  • TRL version: 0.12.0.dev0
  • bitsandbytes version: not installed
  • DeepSpeed version: not installed
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: not installed
  • PEFT version: not installed

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

accelerate launch

accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml  examples/scripts/sft.py \
    --model_name_or_path ${BASE_MODEL_PATH} \
    --dataset_name test \
    --max_seq_length 2048 \
    --dataset_num_proc 8 \
    --torch_dtype auto \
    --output_dir ${CKPT_DIR}/${BASE_NAME}_${DATA_NAME}/${LR}_${BS} \
    --overwrite_output_dir True \
    --learning_rate 2.0e-5 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --logging_steps 10 \
    --save_strategy steps \
    --save_steps 100 \
    --report_to wandb \
    --run_name test

torchrun launch

torchrun --nproc_per_node=8  examples/scripts/sft.py \
    --model_name_or_path ${BASE_MODEL_PATH} \
    --dataset_name test \
    --max_seq_length 2048 \
    --dataset_num_proc 8 \
    --torch_dtype auto \
    --output_dir ${CKPT_DIR}/${BASE_NAME}_${DATA_NAME}/${LR}_${BS} \
    --overwrite_output_dir True \
    --learning_rate 2.0e-5 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --logging_steps 10 \
    --save_strategy steps \
    --save_steps 100 \
    --report_to wandb \
    --run_name test

Expected behavior

I noticed a significant difference in training logs between torchrun launch and accelerate launch

accelerate:

{'loss': 17095.9656, 'grad_norm': nan, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.997287040707416e-06, 'epoch': 0.34}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.984013249524775e-06, 'epoch': 0.4}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.959739037032001e-06, 'epoch': 0.45}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.92457190127261e-06, 'epoch': 0.51}

torchrun:

{'loss': 17.4955, 'grad_norm': 39.65221405029297, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06}
{'loss': 17.304, 'grad_norm': 12.62601375579834, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11}
{'loss': 16.8077, 'grad_norm': 8.29493236541748, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17}
{'loss': 16.5808, 'grad_norm': 5.289371490478516, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23}
{'loss': 16.3465, 'grad_norm': 4.720265865325928, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28}

I would like to know what's wrong with accelerate launch. Thanks!

@qgallouedec qgallouedec added the ❓ question Seeking clarification or more information label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
❓ question Seeking clarification or more information
Projects
None yet
Development

No branches or pull requests

2 participants