Extremely slow training speed when using multiple GPU #23

DekuLiuTesla · 2024-04-10T14:09:53Z

Hi, thanks for open-sourcing such great work!

The introduced multi-gpu parallel training is indeed useful. However, I found it extremely slow. Take the rubble dataset of Mill19 as an example, I trained a 3DGS with 60000 iterations. The first half training with densification on single A100 costs 0.8 hours, but the later half training on 4 A100 costs 10.1 hours. Here is my script, and large_scale.yaml only sets optimization parameters.

I'm wondering if you have met similar problems or have any ideas about solutions. Thanks!

yzslab · 2024-06-23T06:46:48Z

I seldom use DDP and have tested on several simple scenes but did not meet the problem you mentioned. So did not have any ideas about it when you opened this issue.
I happened to see this Lightning-AI/pytorch-lightning#17212 (comment) just now. The slow speed may relate to the enabled find_unused_parameters option.

insomniaaac · 2024-09-16T07:02:47Z

meet same issue.
a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

yzslab · 2024-09-16T07:05:48Z

meet same issue. a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

Try 2.16. New Multiple GPU training strategy.

insomniaaac · 2024-09-16T07:08:21Z

meet same issue. a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

Try 2.16. New Multiple GPU training strategy.

yeah, the situation i mentioned above is about the new strategy.

yzslab · 2024-09-16T07:20:24Z

meet same issue. a single small scene dataset trained 30min w/o ddp on 1 rtx4090, but ~3.5hours w/ ddp on 8 rtx4090.

Try 2.16. New Multiple GPU training strategy.

yeah, the situation i mentioned above is about the new strategy.

Would you mind providing the config file content and command you used to run the training?

insomniaaac · 2024-09-16T07:25:40Z

# common configuration
cache_all_images: true
max_steps: 50000
save_iterations:
- 7000
- 30000
- 50000

# dataset configuration
data:
  val_max_num_images_to_cache: -1
  test_max_num_images_to_cache: -1
  parser: internal.dataparsers.colmap_dataparser.Colmap

# trainer configuration
trainer:
  strategy:
    class_path: internal.mp_strategy.MPStrategy
  devices: -1

# model configuration
model:
  # renderer configuration
  renderer: 
    class_path: internal.renderers.gsplat_distributed_renderer.GSplatDistributedRenderer
  
  # metric configuration
  metric:
    class_path: internal.metrics.mcmc_metrics.MCMCMetrics
  
  # density controller configuration
  density: 
    class_path: internal.density_controllers.distributed_mcmc_density_controller.DistributedMCMCDensityController
    init_args:
      cap_max: 300000
      densify_until_iter: 45000

I modified mcmc_density_controller so that it can work in distributed training.

yzslab · 2024-09-16T08:18:31Z

@insomniaaac

I do not have ideas about yours either.
I tried both the 8 GPUs and single GPU modes with the Family scene just now. Though 8 GPUs mode is slower, not as much as you mentioned, it takes about 1.5x time.

These GPUs are mounted on 4 nodes, and all of the nodes are connected using IB.

Maybe you can enable profiler with the option --trainer.profiler simple to see which part consumes most of the time, or try Grendel-GS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely slow training speed when using multiple GPU #23

Extremely slow training speed when using multiple GPU #23

DekuLiuTesla commented Apr 10, 2024

yzslab commented Jun 23, 2024

insomniaaac commented Sep 16, 2024

yzslab commented Sep 16, 2024

insomniaaac commented Sep 16, 2024

yzslab commented Sep 16, 2024 •

edited

Loading

insomniaaac commented Sep 16, 2024

yzslab commented Sep 16, 2024

Extremely slow training speed when using multiple GPU #23

Extremely slow training speed when using multiple GPU #23

Comments

DekuLiuTesla commented Apr 10, 2024

yzslab commented Jun 23, 2024

insomniaaac commented Sep 16, 2024

yzslab commented Sep 16, 2024

insomniaaac commented Sep 16, 2024

yzslab commented Sep 16, 2024 • edited Loading

insomniaaac commented Sep 16, 2024

yzslab commented Sep 16, 2024

yzslab commented Sep 16, 2024 •

edited

Loading