We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
You can continue the conversation there. Go to discussion →
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
作者你好, 我采用sft -> dpo两阶段训练方式,每一阶段的训练都是采用lora,运行命令如下:
sft:
FORCE_TORCHRUN=1 NNODES=$WORLD_SIZE NODE_RANK=$RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT TORCH_USE_CUDA_DSA=1 CUDA_LAUNCH_BLOCKING=1 WANDB_MODE=disabled llamafactory-cli train --model_name_or_path Qwen2-VL-7B-Instruct --stage sft --do_train --finetuning_type lora --deepspeed examples/deepspeed/ds_z3_config.json --dataset $dataset --template qwen2_vl --cutoff_len 1000000 --max_samples 100000000 --preprocessing_num_workers 128 --output_dir $output_dir --logging_steps 10 --save_steps 50 --plot_loss --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 1e-4 --num_train_epochs 3.0 --lr_scheduler_type cosine --warmup_ratio 0.1 --bf16 --ddp_timeout 180000000 --val_size 0.05 --per_device_eval_batch_size 1 --eval_strategy steps --eval_steps 10000 --video_maxlen 768 --overwrite_output_dir --overwrite_cache True \
dpo:
WANDB_MODE=disabled llamafactory-cli train --model_name_or_path ./LLaMA-Factory/saves/qwen2_vl-7b/mix_sft_72b_bs64/1e-4/full_model --stage dpo --do_train true --finetuning_type lora --lora_target all --pref_beta 0.1 --pref_loss sigmoid --deepspeed examples/deepspeed/ds_z3_config.json --dataset $dataset --template qwen2_vl --cutoff_len 10000000 --max_samples 100000 --preprocessing_num_workers 32 --output_dir $output_dir --logging_steps 10 --save_steps 20 --plot_loss --per_device_train_batch_size 1 --gradient_accumulation_steps 16 --learning_rate 5e-6 --num_train_epochs 10 --lr_scheduler_type cosine --warmup_ratio 0.1 --bf16 --ddp_timeout 180000000 --val_size 0.0001 --per_device_eval_batch_size 1 --eval_strategy steps --eval_steps 500 --video_maxlen 128 --overwrite_output_dir --overwrite_cache True
其中,./LLaMA-Factory/saves/qwen2_vl-7b/mix_sft_72b_bs64/1e-4/full_model保存了合并sft lora之后的模型;
但是我打印了一下dpo训练之前和训练之后的模型,发现二者的adapter_config.json也完全一致,workflow.py传入trainer之前的网络结构完全一致,如下:
PeftModelForCausalLM( (base_model): LoraModel( (model): Qwen2VLForConditionalGeneration( (visual): Qwen2VisionTransformerPretrainedModel( (patch_embed): PatchEmbed( (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False) ) (rotary_pos_emb): VisionRotaryEmbedding() (blocks): ModuleList( (0-31): 32 x Qwen2VLVisionBlock( (norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True) (norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True) (attn): VisionSdpaAttention( (qkv): Linear(in_features=1280, out_features=3840, bias=True) (proj): Linear(in_features=1280, out_features=1280, bias=True) ) (mlp): VisionMlp( (fc1): Linear(in_features=1280, out_features=5120, bias=True) (act): QuickGELUActivation() (fc2): Linear(in_features=5120, out_features=1280, bias=True) ) ) ) (merger): PatchMerger( (ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True) (mlp): Sequential( (0): Linear(in_features=5120, out_features=5120, bias=True) (1): GELU(approximate='none') (2): Linear(in_features=5120, out_features=3584, bias=True) ) ) ) (model): Qwen2VLModel( (embed_tokens): Embedding(152064, 3584) (layers): ModuleList( (0-27): 28 x Qwen2VLDecoderLayer( (self_attn): Qwen2VLSdpaAttention( (q_proj): lora.Linear( (base_layer): Linear(in_features=3584, out_features=3584, bias=True) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=3584, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=3584, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (k_proj): lora.Linear( (base_layer): Linear(in_features=3584, out_features=512, bias=True) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=3584, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=512, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (v_proj): lora.Linear( (base_layer): Linear(in_features=3584, out_features=512, bias=True) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=3584, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=512, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (o_proj): lora.Linear( (base_layer): Linear(in_features=3584, out_features=3584, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=3584, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=3584, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (rotary_emb): Qwen2VLRotaryEmbedding() ) (mlp): Qwen2MLP( (gate_proj): lora.Linear( (base_layer): Linear(in_features=3584, out_features=18944, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=3584, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=18944, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (up_proj): lora.Linear( (base_layer): Linear(in_features=3584, out_features=18944, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=3584, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=18944, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (down_proj): lora.Linear( (base_layer): Linear(in_features=18944, out_features=3584, bias=False) (lora_dropout): ModuleDict( (default): Identity() ) (lora_A): ModuleDict( (default): Linear(in_features=18944, out_features=8, bias=False) ) (lora_B): ModuleDict( (default): Linear(in_features=8, out_features=3584, bias=False) ) (lora_embedding_A): ParameterDict() (lora_embedding_B): ParameterDict() (lora_magnitude_vector): ModuleDict() ) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm((0,), eps=1e-06) (post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06) ) ) (norm): Qwen2RMSNorm((0,), eps=1e-06) (rotary_emb): Qwen2VLRotaryEmbedding() ) (lm_head): Linear(in_features=3584, out_features=152064, bias=False) ) ) )
现在这个网络看起来只是dpo lora从sft lora初始化,而不是在此基础上合并一个新的lora;如果我希望dpo训练完应该合并一个新的lora,请问应该如何正确操作呢?
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reminder
System Info
作者你好,
我采用sft -> dpo两阶段训练方式,每一阶段的训练都是采用lora,运行命令如下:
sft:
FORCE_TORCHRUN=1 NNODES=$WORLD_SIZE NODE_RANK=$RANK MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT TORCH_USE_CUDA_DSA=1 CUDA_LAUNCH_BLOCKING=1 WANDB_MODE=disabled llamafactory-cli train
--model_name_or_path Qwen2-VL-7B-Instruct
--stage sft
--do_train
--finetuning_type lora
--deepspeed examples/deepspeed/ds_z3_config.json
--dataset $dataset
--template qwen2_vl
--cutoff_len 1000000
--max_samples 100000000
--preprocessing_num_workers 128
--output_dir $output_dir
--logging_steps 10
--save_steps 50
--plot_loss
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--learning_rate 1e-4
--num_train_epochs 3.0
--lr_scheduler_type cosine
--warmup_ratio 0.1
--bf16
--ddp_timeout 180000000
--val_size 0.05
--per_device_eval_batch_size 1
--eval_strategy steps
--eval_steps 10000
--video_maxlen 768
--overwrite_output_dir
--overwrite_cache True \
dpo:
WANDB_MODE=disabled llamafactory-cli train
--model_name_or_path ./LLaMA-Factory/saves/qwen2_vl-7b/mix_sft_72b_bs64/1e-4/full_model
--stage dpo
--do_train true
--finetuning_type lora
--lora_target all
--pref_beta 0.1
--pref_loss sigmoid
--deepspeed examples/deepspeed/ds_z3_config.json
--dataset $dataset
--template qwen2_vl
--cutoff_len 10000000
--max_samples 100000
--preprocessing_num_workers 32
--output_dir $output_dir
--logging_steps 10
--save_steps 20
--plot_loss
--per_device_train_batch_size 1
--gradient_accumulation_steps 16
--learning_rate 5e-6
--num_train_epochs 10
--lr_scheduler_type cosine
--warmup_ratio 0.1
--bf16
--ddp_timeout 180000000
--val_size 0.0001
--per_device_eval_batch_size 1
--eval_strategy steps
--eval_steps 500
--video_maxlen 128
--overwrite_output_dir
--overwrite_cache True
其中,./LLaMA-Factory/saves/qwen2_vl-7b/mix_sft_72b_bs64/1e-4/full_model保存了合并sft lora之后的模型;
但是我打印了一下dpo训练之前和训练之后的模型,发现二者的adapter_config.json也完全一致,workflow.py传入trainer之前的网络结构完全一致,如下:
PeftModelForCausalLM(
(base_model): LoraModel(
(model): Qwen2VLForConditionalGeneration(
(visual): Qwen2VisionTransformerPretrainedModel(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
(rotary_pos_emb): VisionRotaryEmbedding()
(blocks): ModuleList(
(0-31): 32 x Qwen2VLVisionBlock(
(norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn): VisionSdpaAttention(
(qkv): Linear(in_features=1280, out_features=3840, bias=True)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp): VisionMlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): QuickGELUActivation()
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger): PatchMerger(
(ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=3584, bias=True)
)
)
)
(model): Qwen2VLModel(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2VLDecoderLayer(
(self_attn): Qwen2VLSdpaAttention(
(q_proj): lora.Linear(
(base_layer): Linear(in_features=3584, out_features=3584, bias=True)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3584, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=3584, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear(
(base_layer): Linear(in_features=3584, out_features=512, bias=True)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3584, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=512, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear(
(base_layer): Linear(in_features=3584, out_features=512, bias=True)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3584, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=512, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear(
(base_layer): Linear(in_features=3584, out_features=3584, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3584, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=3584, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): lora.Linear(
(base_layer): Linear(in_features=3584, out_features=18944, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3584, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=18944, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(up_proj): lora.Linear(
(base_layer): Linear(in_features=3584, out_features=18944, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=3584, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=18944, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(down_proj): lora.Linear(
(base_layer): Linear(in_features=18944, out_features=3584, bias=False)
(lora_dropout): ModuleDict(
(default): Identity()
)
(lora_A): ModuleDict(
(default): Linear(in_features=18944, out_features=8, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=8, out_features=3584, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((0,), eps=1e-06)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
)
)
现在这个网络看起来只是dpo lora从sft lora初始化,而不是在此基础上合并一个新的lora;如果我希望dpo训练完应该合并一个新的lora,请问应该如何正确操作呢?
Reproduction
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: