-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stalls when using multiple GPU's #33
Comments
I looked through the history of issues and here are 3 others with the same bug: |
In prior tests I was relying on CUDA 11.1. Seeing as how the environment.yml lists CUDA 11.3, I thought it would be worth testing out with the required CUDA library version. It took some tinkering but I was able to get CUDA 11.3 functional with the latest version of this repo. But I'm still seeing the same stalling behavior. So it stalls when executing |
I tried another few tests where I set the environment variable to a specific GPU so that the StyleGAN training would only execute on a specific GPU. So I can confirm that both of my GPU's are setup correctly for use in Python. Training runs smoothly on GPU0. Training runs smoothly on GPU1. Training stalls as described prior. |
I was finally able to get the training to execute successfully on 2 GPU's after following the directions found over on issue 218. It's a bit of a hack but it works. FYI I'm running Windows 10. Would it be possible to implement a more permanent fix for this bug? |
That is indeed a bit of a hack. I haven't encountered errors when training with multiple GPUs (RTX 6000 and A40s), so perhaps there's something else I'm missing. I'll try to figure it out, but if you can share more on your environment and such, that'd be helpful to narrow it down. |
I saw a comment from a contributor on the StyleGAN3 codebase mentioning that they don't typically run mult-GPU setups using Windows, presumably Linux instead. So I'm not sure how heavily it's been tested on Windows. The other issues linked above also mention using Windows, so that seems telling. Below is some info about my environment setup and hardware. Let me know if you need any other details. Software Environment
Hardware
|
I have been struggling to utilize 2 GPU's when training. After executing the code below, everything loads as usual, and then it stalls when reaching the training step. But when I execute the code below using <--gpus=1> then it run perfectly.
python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl
I'm not running out of VRAM (x2: Quadro RTX 5000 16GB) or RAM (32GB). Here is a screenshot where you can see both GPU's have 0% load for an extended time:
I believe that both GPU's are correctly setup and StyleGAN2 should be able to use them both. Here is a screenshot after having run:
nvidia-smi
I was doing some googling to see if anyone else has had a similar issue... And interestingly this recent issue over on the original repository seems to describe my problem precisely. Yet when I tried out the suggested fix then I still experienced the same problem as before with it stalling upon reaching the training step.
Am I missing some detail or is this a bug? Thanks!
The text was updated successfully, but these errors were encountered: