You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
Reminder
System Info
llamafactory
version: 0.9.0Reproduction
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch
--config_file examples/accelerate/fsdp_config.yaml
src/train.py examples/extras/fsdp_qlora/llama3_lora_sft3.yaml
llama3_lora_sft3.yaml:
model
model_name_or_path: meta-llama/Meta-Llama-3-70B-Instruct
resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800
method
stage: sft
do_train: true
finetuning_type: full
flash_attn: fa2
dataset
dataset: sft_training_data_2w
template: llama3
cutoff_len: 4096
overwrite_cache: false
preprocessing_num_workers: 4
output
output_dir: saves/taught/sft/3-70b-2w-1
logging_steps: 10
save_steps: 200
plot_loss: true
overwrite_output_dir: false
max_length: 8192
train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-6
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_steps: 100
bf16: true
ddp_timeout: 180000000
seed: 2
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
FutureWarning: You are using
torch.load
withweights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_only
will be flipped toTrue
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals
. We recommend you start settingweights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.oom killed
Expected behavior
期望能够从断点续训。只在原始代码的基础上加了resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800
一直报错oom 无法开始训练。没有加resume_from_checkpoint: /sft/3-70b-2w-1/checkpoint-800从头训的话不会报错oom(但是训了一段时间后也有可能oom)情况基本与https://github.com/hiyouga/LLaMA-Factory/issues/5771相同
Others
No response
The text was updated successfully, but these errors were encountered: