Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow training speed when using multiple GPU #23

Open
DekuLiuTesla opened this issue Apr 10, 2024 · 7 comments
Open

Extremely slow training speed when using multiple GPU #23

DekuLiuTesla opened this issue Apr 10, 2024 · 7 comments

Comments

@DekuLiuTesla
Copy link

Hi, thanks for open-sourcing such great work!

The introduced multi-gpu parallel training is indeed useful. However, I found it extremely slow. Take the rubble dataset of Mill19 as an example, I trained a 3DGS with 60000 iterations. The first half training with densification on single A100 costs 0.8 hours, but the later half training on 4 A100 costs 10.1 hours. Here is my script, and large_scale.yaml only sets optimization parameters.

image

I'm wondering if you have met similar problems or have any ideas about solutions. Thanks!

@yzslab
Copy link
Owner

yzslab commented Jun 23, 2024

I seldom use DDP and have tested on several simple scenes but did not meet the problem you mentioned. So did not have any ideas about it when you opened this issue.
I happened to see this Lightning-AI/pytorch-lightning#17212 (comment) just now. The slow speed may relate to the enabled find_unused_parameters option.

@insomniaaac
Copy link

meet same issue.
a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

@yzslab
Copy link
Owner

yzslab commented Sep 16, 2024

meet same issue. a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

Try 2.16. New Multiple GPU training strategy.

@insomniaaac
Copy link

meet same issue. a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

Try 2.16. New Multiple GPU training strategy.

yeah, the situation i mentioned above is about the new strategy.

@yzslab
Copy link
Owner

yzslab commented Sep 16, 2024

meet same issue. a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

Try 2.16. New Multiple GPU training strategy.

yeah, the situation i mentioned above is about the new strategy.

Would you mind providing the config file content and command you used to run the training?

@insomniaaac
Copy link

# common configuration
cache_all_images: true
max_steps: 50000
save_iterations:
- 7000
- 30000
- 50000

# dataset configuration
data:
  val_max_num_images_to_cache: -1
  test_max_num_images_to_cache: -1
  parser: internal.dataparsers.colmap_dataparser.Colmap

# trainer configuration
trainer:
  strategy:
    class_path: internal.mp_strategy.MPStrategy
  devices: -1

# model configuration
model:
  # renderer configuration
  renderer: 
    class_path: internal.renderers.gsplat_distributed_renderer.GSplatDistributedRenderer
  
  # metric configuration
  metric:
    class_path: internal.metrics.mcmc_metrics.MCMCMetrics
  
  # density controller configuration
  density: 
    class_path: internal.density_controllers.distributed_mcmc_density_controller.DistributedMCMCDensityController
    init_args:
      cap_max: 300000
      densify_until_iter: 45000

I modified mcmc_density_controller so that it can work in distributed training.

@yzslab
Copy link
Owner

yzslab commented Sep 16, 2024

@insomniaaac

I do not have ideas about yours either.
I tried both the 8 GPUs and single GPU modes with the Family scene just now. Though 8 GPUs mode is slower, not as much as you mentioned, it takes about 1.5x time.
image
These GPUs are mounted on 4 nodes, and all of the nodes are connected using IB.

Maybe you can enable profiler with the option --trainer.profiler simple to see which part consumes most of the time, or try Grendel-GS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants