-
-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataParallel is used by auto_model with single GPU #2447
Comments
@H4dr1en I haven't yet checked the code but I agree that this seems useless to wrap the model by EDIT: ignite/ignite/distributed/auto.py Line 228 in 6d83dd7
It takes all available GPUs and wrap the model with DataParallel on all GPUs.
Yes, I think this is the most simple way to pick the GPU to use without updating the ignite code and API on how to pick the GPU to use. I think this is expected behaviour if you have N GPUs, start the script without any restriction on how many GPUs to use and use |
DataParallel seems to drastically slow down the model training, for a single GPU. I tried with the cifar10 example from the pytorch-ignite/examples repository on this fork, I ran this script on a g4dn.12xlarge AWS instance (x4 T4 GPUs). The default training time reported by ignite is
If I set
This is the only change I made. If this is truly coming from the DataParallel being not efficient for a single GPU, I see no reason why it should be automatically set. I'd expect the |
Thanks for details @H4dr1en ! First, PyTorch recommends to use If I understand correctly your code, your infrastructure and the way you launched it, then in case of When you specify
What do you think ? |
Thanks for clarifying @vfdev-5 ! Indeed I can observe that I am training on 4 GPUs with DataParallel by default. I understand the logic of idist.auto_model, my confusion comes from the fact that when I set |
Yeah, I was also confused about how you are using
I'd expect training on 4 GPUs in DDP mode (not DP) is faster than 1 GPU. In your case, I suppose the slowdown with |
The subprocesses have args.local_rank defined, so they just start the training (skip L435)
Yes, on the same page. What I want to raise in this issue is the fact that |
Yes, it makes perfectly sense, but how ignite can know that you have In addition, there can be (old) cases when we would like to use DP : one process and use multiple GPUs. |
Can we maybe check the world_size?
Is it justified to keep this use case, now that DDP is out and is faster than DP? I don't fully understand the different use cases, so I might be wrong, in which case I understand that we should not change this behaviour |
yes, we are using world_size to setup DDP. If world_size is defined and >1 then there is a distributed processing group and there is no point to use DP. Here is the code: ignite/ignite/distributed/auto.py Lines 201 to 230 in 6d83dd7
If there is no distributed processing group, but we have more then one GPUs available, we can use DP. To enable distributed processing group, user can specify the backend in
In our case we leave the decision to the user. By launching a single process ( |
🐛 Bug description
I am not sure whether it is a bug or a feature:
The
DataParallel
is being applied/patched byidist.auto_model
in the context of a single gpu (backend=None, nproc_per_node=1). What is the reason behind this choice? Does it bring any speed improvements?The only way to prevent it is to set
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
for single-gpu contexts.Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: