We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
examples
accelerate launch
accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml examples/scripts/sft.py \ --model_name_or_path ${BASE_MODEL_PATH} \ --dataset_name test \ --max_seq_length 2048 \ --dataset_num_proc 8 \ --torch_dtype auto \ --output_dir ${CKPT_DIR}/${BASE_NAME}_${DATA_NAME}/${LR}_${BS} \ --overwrite_output_dir True \ --learning_rate 2.0e-5 \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --gradient_checkpointing \ --lr_scheduler_type cosine \ --warmup_ratio 0.1 \ --logging_steps 10 \ --save_strategy steps \ --save_steps 100 \ --report_to wandb \ --run_name test
torchrun launch
torchrun --nproc_per_node=8 examples/scripts/sft.py \ --model_name_or_path ${BASE_MODEL_PATH} \ --dataset_name test \ --max_seq_length 2048 \ --dataset_num_proc 8 \ --torch_dtype auto \ --output_dir ${CKPT_DIR}/${BASE_NAME}_${DATA_NAME}/${LR}_${BS} \ --overwrite_output_dir True \ --learning_rate 2.0e-5 \ --num_train_epochs 1 \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --gradient_checkpointing \ --lr_scheduler_type cosine \ --warmup_ratio 0.1 \ --logging_steps 10 \ --save_strategy steps \ --save_steps 100 \ --report_to wandb \ --run_name test
I noticed a significant difference in training logs between torchrun launch and accelerate launch
accelerate:
{'loss': 17095.9656, 'grad_norm': nan, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.997287040707416e-06, 'epoch': 0.34} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.984013249524775e-06, 'epoch': 0.4} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.959739037032001e-06, 'epoch': 0.45} {'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.92457190127261e-06, 'epoch': 0.51}
torchrun:
{'loss': 17.4955, 'grad_norm': 39.65221405029297, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06} {'loss': 17.304, 'grad_norm': 12.62601375579834, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11} {'loss': 16.8077, 'grad_norm': 8.29493236541748, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17} {'loss': 16.5808, 'grad_norm': 5.289371490478516, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23} {'loss': 16.3465, 'grad_norm': 4.720265865325928, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28}
I would like to know what's wrong with accelerate launch. Thanks!
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
Information
Tasks
examples
folderReproduction
accelerate launch
torchrun launch
Expected behavior
I noticed a significant difference in training logs between torchrun launch and accelerate launch
accelerate:
{'loss': 17095.9656, 'grad_norm': nan, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.997287040707416e-06, 'epoch': 0.34}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.984013249524775e-06, 'epoch': 0.4}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.959739037032001e-06, 'epoch': 0.45}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.92457190127261e-06, 'epoch': 0.51}
torchrun:
{'loss': 17.4955, 'grad_norm': 39.65221405029297, 'learning_rate': 9.433962264150944e-07, 'epoch': 0.06}
{'loss': 17.304, 'grad_norm': 12.62601375579834, 'learning_rate': 1.8867924528301889e-06, 'epoch': 0.11}
{'loss': 16.8077, 'grad_norm': 8.29493236541748, 'learning_rate': 2.830188679245283e-06, 'epoch': 0.17}
{'loss': 16.5808, 'grad_norm': 5.289371490478516, 'learning_rate': 3.7735849056603777e-06, 'epoch': 0.23}
{'loss': 16.3465, 'grad_norm': 4.720265865325928, 'learning_rate': 4.716981132075472e-06, 'epoch': 0.28}
I would like to know what's wrong with accelerate launch. Thanks!
The text was updated successfully, but these errors were encountered: