-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I use peft to finetune llama2, the gpu memory keeps growing #2141
Comments
There can be a few reasons for that. For instance, hidden states can grow quite fast with sequence length, so if you have batches with very uneven length, it's not surprising to see memory fluctuate. We can't rule out some form of memory leak but that would normally manifest as a slow and steady growth in memory.
This is also hard to diagnose. How do you run the model? I assume DeepSpeed or FSDP, how are they configured? One thing you could try to help with diagnosing is to remove PEFT from the equation, i.e. do full fine-tuning. Perhaps you can fit that into memory by reducing the batch size and then check if the same issues as previously occur. |
GPU memory will grow by about 4G~6G at a time
just run: python demo.py. I've tried using torchrun and accelerate but this conflicts with device_map='auto' |
As mentioned, it's really hard to say what the reason is, but I don't think it's a memory leak. Can you maybe check if your training data (after all processing steps) has very unequal sequence lengths?
Okay, so what this means is that you're not using model parallel training (which would require DeepSpeed or FSDP, which can be used via accelerate). Instead, by default, the |
I commented device_map='auto' and use accelerate launch --config_file "deepspeed_config.yaml" demo.py, then the error:
my deepspeed_config.yaml:
|
I found the reason, gradient_accumulation_steps value in deepspeed_config.yaml and demo.py is not the same. |
Okay, so if you set the same values, does that resolve the issue and the model trains successfully? |
System Info
torch 2.4.1
transformers 4.46.0.dev0
trl 0.11.2
peft 0.13.1
GPU V100
CUDA 12.4
nvidia driver 550.54.15
Who can help?
No response
Information
Tasks
examples
folderReproduction
Expected behavior
In the middle of fine-tuning llama, the above error will be reported. And I found that after every few steps, the GPU memory
will grow a lot until it reaches the maximum GPU memory.There is another phenomenon, that is, during fine-tuning, only one GPU has a usage rate of 100%, and the other GPU usage rates are 0. Moreover, the GPU ID with a usage rate of 100% is not fixed and will change every time.
The text was updated successfully, but these errors were encountered: