33b Model on 4xA100 (40GB) OOM #666

AlexanderZhk · 2024-04-09T18:53:46Z

AlexanderZhk
Apr 9, 2024

Trying to LORA fine-tune a 33b model on 4xA100 (40GB) and getting OOM errors. Using fp16.
Tried various DeepSpeed settings, which lead to either OOM or extremely slow performance. (even evaluation doesn't generate a full answer after 600s)

From my understanding, this hardware should be enough for this task, am I missing something?

Training config:

architecture:
    backbone_dtype: bfloat16
    force_embedding_gradients: false
    gradient_checkpointing: true
    intermediate_dropout: 0.0
    pretrained: true
    pretrained_weights: ''
augmentation:
    neftune_noise_alpha: 0.0
    random_parent_probability: 0.0
    skip_parent_probability: 0.0
    token_mask_probability: 0.0
dataset:
    add_eos_token_to_answer: true
    add_eos_token_to_prompt: true
    add_eos_token_to_system: true
    answer_column: answer
    chatbot_author: H2O.ai
    chatbot_name: h2oGPT
    data_sample: 1.0
    data_sample_choice:
    - Train
    - Validation
    limit_chained_samples: false
    mask_prompt_labels: false
    parent_id_column: None
    personalize: false
    prompt_column:
    - full user_prompt
    system_column: system
    text_answer_separator: '### Response:'
    text_prompt_start: '### Instruction:'
    text_system_start: ''
    train_dataframe: /workspace/data/user/synthetic_full_v1/synthetic_full_v1.csv
    validation_dataframe: None
    validation_size: 0.15
    validation_strategy: automatic
environment:
    compile_model: true
    deepspeed_allgather_bucket_size: 1000000
    deepspeed_method: ZeRO3
    deepspeed_offload_optimizer: true
    deepspeed_offload_param: false
    deepspeed_reduce_bucket_size: 1000000
    deepspeed_stage3_param_persistence_threshold: 1000000
    deepspeed_stage3_prefetch_bucket_size: 1000000
    find_unused_parameters: false
    gpus:
    - '0'
    - '1'
    - '2'
    - '3'
    huggingface_branch: main
    mixed_precision: false
    number_of_workers: 96
    seed: -1
    trust_remote_code: true
    use_deepspeed: true
experiment_name: 33B DeepSpeed LORA.2.1
llm_backbone: deepseek-ai/deepseek-coder-33b-instruct
logging:
    logger: None
    neptune_project: ''
output_directory: /workspace/output/user/33B DeepSpeed LORA.2.1/
prediction:
    batch_size_inference: 1
    do_sample: false
    max_length_inference: 509
    max_time: 600.0
    metric: BLEU
    metric_gpt_model: gpt-3.5-turbo-0301
    metric_gpt_template: general
    min_length_inference: 2
    num_beams: 1
    num_history: 4
    repetition_penalty: 1.0
    stop_tokens: ''
    temperature: 0.0
    top_k: 0
    top_p: 1.0
problem_type: text_causal_language_modeling
tokenizer:
    add_prompt_answer_tokens: false
    max_length: 1920
    max_length_answer: 1984
    max_length_prompt: 1984
    padding_quantile: 1.0
    use_fast: false
training:
    batch_size: 1
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 6
    evaluate_before_training: true
    evaluation_epochs: 1.0
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 0.0001
    lora: true
    lora_alpha: 128
    lora_dropout: 0.05
    lora_r: 64
    lora_target_modules: ''
    loss_function: TokenAveragedCrossEntropy
    optimizer: Adam
    save_best_checkpoint: false
    schedule: Cosine
    train_validation_data: false
    use_flash_attention_2: false
    warmup_epochs: 0.0
    weight_decay: 0.0

Answered by psinger

Apr 9, 2024

I have never seen any real performance degredation doing lora in int4. The final weights will be merged back and you can put the model into production in any precision.

Deepspeed has issues with generation inference, I would recommend switching the metric to Perplexity, which will do raw logit evaluation. This should speed up the validation speed significantly.

View full answer

psinger · 2024-04-09T18:54:52Z

psinger
Apr 9, 2024
Maintainer

Did you try int4 with LoRA and without deepspeed?

5 replies

AlexanderZhk Apr 9, 2024
Author

Thank you for the quick answer! Yes, int4 does, indeed, work, however, from previous preliminary tests on a 7b model, we get a noticeable difference in quality in training in int4 vs float16.
I'm particularly interested in Deepspeed, as it should(?) allow to train in fp16 on our hardware.

in int4 I'm getting 51.06s/it for validation, (b)float16 + deepspeed "times out" after 600s/it (the maximum allowed in the GUI).

psinger Apr 9, 2024
Maintainer

I have never seen any real performance degredation doing lora in int4. The final weights will be merged back and you can put the model into production in any precision.

Deepspeed has issues with generation inference, I would recommend switching the metric to Perplexity, which will do raw logit evaluation. This should speed up the validation speed significantly.

Answer selected by AlexanderZhk

AlexanderZhk Apr 9, 2024
Author

Thank you! I noticed, that it slows down only on validation with DeepSpeed, but somehow it didn't click. We'll also consider retesting int4.

psinger Apr 9, 2024
Maintainer

Afaik, it only happens with LoRA and Deepspeed. Let me know if it helps!

AlexanderZhk Apr 15, 2024
Author

Yes, was able to train a few experiments now! But now, when I'm trying to export those models, I get this error (OOM) #670

which on one hand is quite obvious - I used DeepSpeed to fit the model and LoRa into VRAM. But on the other hand - I'm lost on how to get the trained model out of H2O now. Maybe there is some documentation of the .pth file structure, so I could write up a script to convert it on the CPU? I haven't tried llama.cpp's pth conversion, yet, but I have doubts it will just work.

edit: nvm, figured it out. If you run an experiment in the background it works. (details and explanation: #670)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

33b Model on 4xA100 (40GB) OOM #666

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

33b Model on 4xA100 (40GB) OOM #666

AlexanderZhk Apr 9, 2024

Replies: 1 comment · 5 replies

psinger Apr 9, 2024 Maintainer

AlexanderZhk Apr 9, 2024 Author

psinger Apr 9, 2024 Maintainer

AlexanderZhk Apr 9, 2024 Author

psinger Apr 9, 2024 Maintainer

AlexanderZhk Apr 15, 2024 Author

AlexanderZhk
Apr 9, 2024

Replies: 1 comment 5 replies

psinger
Apr 9, 2024
Maintainer

AlexanderZhk Apr 9, 2024
Author

psinger Apr 9, 2024
Maintainer

AlexanderZhk Apr 9, 2024
Author

psinger Apr 9, 2024
Maintainer

AlexanderZhk Apr 15, 2024
Author