Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stalls when using multiple GPU's #33

Open
nuclearsugar opened this issue Apr 4, 2023 · 6 comments
Open

Training stalls when using multiple GPU's #33

nuclearsugar opened this issue Apr 4, 2023 · 6 comments

Comments

@nuclearsugar
Copy link

I have been struggling to utilize 2 GPU's when training. After executing the code below, everything loads as usual, and then it stalls when reaching the training step. But when I execute the code below using <--gpus=1> then it run perfectly.
python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

I'm not running out of VRAM (x2: Quadro RTX 5000 16GB) or RAM (32GB). Here is a screenshot where you can see both GPU's have 0% load for an extended time:
2023-04-04 16_04_10-Greenshot

I believe that both GPU's are correctly setup and StyleGAN2 should be able to use them both. Here is a screenshot after having run:
nvidia-smi
2023-04-04 16_07_56-Window

I was doing some googling to see if anyone else has had a similar issue... And interestingly this recent issue over on the original repository seems to describe my problem precisely. Yet when I tried out the suggested fix then I still experienced the same problem as before with it stalling upon reaching the training step.

Am I missing some detail or is this a bug? Thanks!

@nuclearsugar
Copy link
Author

nuclearsugar commented Apr 8, 2023

In prior tests I was relying on CUDA 11.1.

Seeing as how the environment.yml lists CUDA 11.3, I thought it would be worth testing out with the required CUDA library version. It took some tinkering but I was able to get CUDA 11.3 functional with the latest version of this repo. But I'm still seeing the same stalling behavior. So it stalls when executing --gpus=2, but --gpus=1 runs smoothly.

@nuclearsugar
Copy link
Author

I tried another few tests where I set the environment variable to a specific GPU so that the StyleGAN training would only execute on a specific GPU. So I can confirm that both of my GPU's are setup correctly for use in Python.

Training runs smoothly on GPU0.
--- set CUDA_VISIBLE_DEVICES=0
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training runs smoothly on GPU1.
--- set CUDA_VISIBLE_DEVICES=1
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training stalls as described prior.
--- set CUDA_VISIBLE_DEVICES=0,1
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

@nuclearsugar
Copy link
Author

I was finally able to get the training to execute successfully on 2 GPU's after following the directions found over on issue 218. It's a bit of a hack but it works. FYI I'm running Windows 10.

Would it be possible to implement a more permanent fix for this bug?

@PDillis
Copy link
Owner

PDillis commented May 17, 2023

That is indeed a bit of a hack. I haven't encountered errors when training with multiple GPUs (RTX 6000 and A40s), so perhaps there's something else I'm missing. I'll try to figure it out, but if you can share more on your environment and such, that'd be helpful to narrow it down.

@nuclearsugar
Copy link
Author

I saw a comment from a contributor on the StyleGAN3 codebase mentioning that they don't typically run mult-GPU setups using Windows, presumably Linux instead. So I'm not sure how heavily it's been tested on Windows. The other issues linked above also mention using Windows, so that seems telling.

Below is some info about my environment setup and hardware. Let me know if you need any other details.

Software Environment

  • Windows 10 (21H2)
  • Visual Studio 2019
  • CUDA Toolkit 11.3
  • Instance running within Miniconda3-py39
  • Using the exact same dependencies as listed within environment.yml

Hardware

  • CPU: AMD Ryzen 5950X
  • GPU's: (x2) Nvidia Quadro RTX 5000 16GB
  • RAM: 32GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants