Training stalls when using multiple GPU's #33

nuclearsugar · 2023-04-04T21:37:02Z

I have been struggling to utilize 2 GPU's when training. After executing the code below, everything loads as usual, and then it stalls when reaching the training step. But when I execute the code below using <--gpus=1> then it run perfectly.
python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

I'm not running out of VRAM (x2: Quadro RTX 5000 16GB) or RAM (32GB). Here is a screenshot where you can see both GPU's have 0% load for an extended time:

I believe that both GPU's are correctly setup and StyleGAN2 should be able to use them both. Here is a screenshot after having run:
nvidia-smi

I was doing some googling to see if anyone else has had a similar issue... And interestingly this recent issue over on the original repository seems to describe my problem precisely. Yet when I tried out the suggested fix then I still experienced the same problem as before with it stalling upon reaching the training step.

Am I missing some detail or is this a bug? Thanks!

The text was updated successfully, but these errors were encountered:

nuclearsugar · 2023-04-08T00:01:15Z

I looked through the history of issues and here are 3 others with the same bug:

nuclearsugar · 2023-04-08T03:32:15Z

In prior tests I was relying on CUDA 11.1.

Seeing as how the environment.yml lists CUDA 11.3, I thought it would be worth testing out with the required CUDA library version. It took some tinkering but I was able to get CUDA 11.3 functional with the latest version of this repo. But I'm still seeing the same stalling behavior. So it stalls when executing --gpus=2, but --gpus=1 runs smoothly.

nuclearsugar · 2023-04-08T05:39:55Z

I tried another few tests where I set the environment variable to a specific GPU so that the StyleGAN training would only execute on a specific GPU. So I can confirm that both of my GPU's are setup correctly for use in Python.

Training runs smoothly on GPU0.
--- set CUDA_VISIBLE_DEVICES=0
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training runs smoothly on GPU1.
--- set CUDA_VISIBLE_DEVICES=1
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=1 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

Training stalls as described prior.
--- set CUDA_VISIBLE_DEVICES=0,1
--- python train.py --outdir=results --cfg=stylegan2 --metrics=None --data=escher-512.zip --kimg=5000 --gamma=10 --gpus=2 --batch=32 --batch-gpu=8 --resume=stylegan2-ffhq-512x512.pkl

nuclearsugar · 2023-04-14T00:49:01Z

I was finally able to get the training to execute successfully on 2 GPU's after following the directions found over on issue 218. It's a bit of a hack but it works. FYI I'm running Windows 10.

Would it be possible to implement a more permanent fix for this bug?

PDillis · 2023-05-17T22:28:37Z

That is indeed a bit of a hack. I haven't encountered errors when training with multiple GPUs (RTX 6000 and A40s), so perhaps there's something else I'm missing. I'll try to figure it out, but if you can share more on your environment and such, that'd be helpful to narrow it down.

nuclearsugar · 2023-05-18T00:01:19Z

I saw a comment from a contributor on the StyleGAN3 codebase mentioning that they don't typically run mult-GPU setups using Windows, presumably Linux instead. So I'm not sure how heavily it's been tested on Windows. The other issues linked above also mention using Windows, so that seems telling.

Below is some info about my environment setup and hardware. Let me know if you need any other details.

Software Environment

Windows 10 (21H2)
Visual Studio 2019
CUDA Toolkit 11.3
Instance running within Miniconda3-py39
Using the exact same dependencies as listed within environment.yml

Hardware

CPU: AMD Ryzen 5950X
GPU's: (x2) Nvidia Quadro RTX 5000 16GB
RAM: 32GB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stalls when using multiple GPU's #33

Training stalls when using multiple GPU's #33

nuclearsugar commented Apr 4, 2023

nuclearsugar commented Apr 8, 2023

nuclearsugar commented Apr 8, 2023 •

edited

Loading

nuclearsugar commented Apr 8, 2023

nuclearsugar commented Apr 14, 2023

PDillis commented May 17, 2023

nuclearsugar commented May 18, 2023

Training stalls when using multiple GPU's #33

Training stalls when using multiple GPU's #33

Comments

nuclearsugar commented Apr 4, 2023

nuclearsugar commented Apr 8, 2023

nuclearsugar commented Apr 8, 2023 • edited Loading

nuclearsugar commented Apr 8, 2023

nuclearsugar commented Apr 14, 2023

PDillis commented May 17, 2023

nuclearsugar commented May 18, 2023

nuclearsugar commented Apr 8, 2023 •

edited

Loading