Skip to content

第二阶段训练(多能力冷启动阶段)时grad_norm爆炸 #40

@SymbolZH

Description

@SymbolZH
# Make sure you are in directory ./deepanalyze/ms-swift/
swift sft \
    --model "${MODEL_SINGLE_ABILITY_PATH}" \
    --train_type "lora" \
    --lora_rank 32 \
    --lora_alpha 64 \
    --dataset \
        "${DATA_DIR}/interation/data_pipeline_3601.json#10" \
        "${DATA_DIR}/interation/data_preparation_3311.json#10" \
        "${DATA_DIR}/interation/data_cleaning_1616.json#10" \
        "${DATA_DIR}/interation/data_analysis_3936.json#10" \
        "${DATA_DIR}/interation/data_insight_1062.json#10" \
        "${DATA_DIR}/interation/research_database_818.json#10" \
        "${DATA_DIR}/interation/research_xlsx_848.json#10" \
        "${DATA_DIR}/interation/research_other_3505.json#10" \
        "${DATA_DIR}/interation/research_data_preparation_488.json#10" \
        "${DATA_DIR}/interation/research_data_analysis_1339.json#10" \
        "${DATA_DIR}/interation/research_data_insight_1351.json#10" \
        "${DATA_DIR}/interation/research_report_generation_4327.json#10" \
    --torch_dtype "bfloat16" \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --learning_rate 1e-5 \
    --gradient_accumulation_steps 32 \
    --packing true \
    --eval_steps 1 \
    --save_steps 5 \
    --logging_steps 1 \
    --max_length 32768 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 8 \
    --dataset_num_proc 8 \
    --save_total_limit 1 \
    --response_prefix "" \
    --save_only_model false \
    --output_dir "${MODEL_MULTI_ABILITY_PATH}" \
    --deepspeed "zero3_offload" \
    --use_liger_kernel true \
    --attn_impl "flash_attn" \
    --model_type "deepseek_r1_distill"

如题,请问博主在第二阶段的训练中有出现类似情况吗,刚训练就grad_norm就直接很大

{'loss': 0.85817486, 'grad_norm': 9764998217728.0, 'learning_rate': 1e-05, 'memory(GiB)': 18.41, 'train_speed(iter/s)': 0.014675, 'epoch': 0.17, 'global_step/max_steps': '1/18', 'percentage': '5.56%', 'elapsed_time': '58s', 'remaining_time': '16m 41s'}
Train:   6%|██▍                                        | 1/18 
[00:5Train:  11%|██▋                     | 2/18 [01:40<13:03, 48.96s/it]{'loss': 0.80747825, 'grad_norm': 1.0, 'learning_rate': 9.91e-06, 'memory(GiB)': 20.31, 'train_speed(iter/s)': 0.01816, 'epoch': 0.33, 'global_step/max_steps': '2/18', 'percentage': '11.11%', 'elapsed_time': '1m 40s', 'remaining_time': '13m 27s'}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 9.66e-06, 'memory(GiB)': 20.31, 'train_speed(iter/s)': 0.019781, 'epoch': 0.5, 'global_step/max_steps': '3/18', 'percentage': '16.67%', 'elapsed_time': '2m 22s', 'remaining_time': '11m 52s'}
{'loss': 0.0, 'grad_norm': 1.0, 'learning_rate': 9.25e-06, 'memory(GiB)': 20.31, 'train_speed(iter/s)': 0.020808, 'epoch': 0.67, 'global_step/max_steps': '4/18', 'percentage': '22.22%', 'elapsed_time': '3m 3s', 'remaining_time': '10m 40s'}
Train:  22%|█████▎                  | 4/18 [03:03<10:10, 43.59s/it]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions