-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
微调的时候如何接续训练(从上一个检查点继续) #353
Comments
我也遇到了类似的情况,加了resume_from_checkpoint后运行,我看到打印出来有一行写
finetune_clm_lora.py[627:653] |
请问如何从上一个检查点继续训练呢?我的上一个检查点是checkpoint-585,但是我加了--resume_from_checkpoint ${output_model}/checkpoint-585这个参数后,还是会从头开始训练(加或者不加都是从头开始训练),😵,
下边是我的脚本:
deepspeed --include localhost:0,1,2,3 finetune_clm_lora.py
--model_name_or_path /HOME/
--train_files /HOME/
--validation_files /HOME/
--output_dir /HOME/
--per_device_train_batch_size 16
--per_device_eval_batch_size 16
--do_train
--do_eval
--use_fast_tokenizer false
--output_dir ${output_model}
--evaluation_strategy steps
--max_eval_samples 800
--learning_rate 2.0e-4
--gradient_accumulation_steps 8
--num_train_epochs 5
--warmup_steps 0
--load_in_bits 4
--lora_r 8
--lora_alpha 32
--target_modules q_proj,k_proj,v_proj,o_proj,down_proj,gate_proj,up_proj
--logging_dir ${output_model}/logs
--logging_strategy steps
--logging_steps 500
--save_strategy epoch
--preprocessing_num_workers 16
--save_steps 500
--eval_steps 500
--save_total_limit 2000
--seed 42
--disable_tqdm false
--ddp_find_unused_parameters false
--block_size 2048
--report_to tensorboard
--overwrite_output_dir
--deepspeed ds_config_zero2.json
--ignore_data_skip true
--fp16
--gradient_checkpointing
--fp16_full_eval
--ddp_timeout 18000000
--resume_from_checkpoint ${output_model}/checkpoint-585
下边是加了resume_from_checkpoint 的log:
[INFO|trainer.py:1969] 2024-07-22 09:33:14,086 >> ***** Running training *****
[INFO|trainer.py:1970] 2024-07-22 09:33:14,086 >> Num examples = 100,000
[INFO|trainer.py:1971] 2024-07-22 09:33:14,086 >> Num Epochs = 3
[INFO|trainer.py:1972] 2024-07-22 09:33:14,086 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1975] 2024-07-22 09:33:14,086 >> Total train batch size (w. parallel, distributed & accumulation) = 512
[INFO|trainer.py:1976] 2024-07-22 09:33:14,086 >> Gradient Accumulation steps = 8
[INFO|trainer.py:1977] 2024-07-22 09:33:14,086 >> Total optimization steps = 585
[INFO|trainer.py:1978] 2024-07-22 09:33:14,090 >> Number of trainable parameters = 20,971,520
0%| | 0/585 [00:00<?, ?it/s]
The text was updated successfully, but these errors were encountered: