Increasing number of GPUs does not improve training speed #184

wigging · 2025-07-08T23:59:55Z

wigging
Jul 8, 2025

I trained MatterGen with the MP-20 data using 4 GPUs where each GPU is an A100 with 80 GB of memory. The training stopped in about 4 hours and 50 minutes when it reached 899 epochs which is the default value for max_epochs in the config file. I used the following command to run the training:

mattergen-train data_module=mp_20 ~trainer.logger trainer.devices=4

Next, I ran the same training using 8 GPUs where each GPU is an A100 with 80 GB of memory. The training stopped in about 4 hours and 53 minutes when it reached 899 epochs. I used the command shown below to run the training:

mattergen-train data_module=mp_20 ~trainer.logger trainer.devices=8

I expected the training time to be shorter with 8 GPUs compared to 4 GPUs but it was about the same. Is MatterGen limited by the amount of memory that it can use? Is there a config setting that I need to adjust to take advantage of the extra memory provided by multiple GPUs?

danielzuegner · 2025-07-11T07:45:55Z

danielzuegner
Jul 11, 2025
Maintainer

Hi @wigging,

did you verify that during training indeed all 8 GPUs were used? For MatterGen we always use a total batch size of 512, so for 4 GPUs this would be a per-GPU batch size of 128 and for 8 GPUs it is 64. You can increase this to make better use of your GPUs' memory, but you might have to tune the learning rate for the different batch size.

12 replies

danielzuegner Jul 24, 2025
Maintainer

The thing is that we need to jointly tune the effective batch size with the other training hyperparameters such as learning rate. If we change the effective batch size without adjusting the learning rate accordingly, the training behavior will be different. Thus, we fix the effective batch size and adjust the per-GPU batch size based on the number of available GPUs and the gradient accumulation.

wigging Jul 24, 2025
Author

Ok, let us make sure I understand your suggestions before I run the training again.

Change mattergen/conf/data_module/mp_20.yaml to the following:

batch_size:
  train: ${eval:'(1024 // ${trainer.accumulate_grad_batches}) // (${trainer.devices} * ${trainer.num_nodes})'}

Change mattergen/conf/lightning_module/default.yaml to the following:

optimizer_partial:
  lr: 1.5e-4

scheduler_partials:
  - scheduler:
      min_lr: 1.5e-6

Run the training with this command:

mattergen-train data_module=mp_20 ~trainer.logger trainer.devices=8

Is all of this correct?

danielzuegner Jul 25, 2025
Maintainer

Yes, that looks right.

wigging Jul 29, 2025
Author

The total training time with MP-20 was 2 hours 35 minutes using the config settings above. This was with 8 H100s SXM5 with 80 GB of memory for each GPU. So these settings definitely improved the training time compared to my previous attempt.

wigging Jul 29, 2025
Author

One more thing... If I increased the number of GPUs again to something like 16 GPUs then what would I use for batch size and learning rates? Would I increase the batch size by a factor of 2 again and scale the learning rates by 1.5 once more? So for 16 GPUs the batch size would be 2048 and lr would be 2.25e-4 and min_lr is 2.25e-6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increasing number of GPUs does not improve training speed #184

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Increasing number of GPUs does not improve training speed #184

Uh oh!

wigging Jul 8, 2025

Replies: 1 comment · 12 replies

Uh oh!

danielzuegner Jul 11, 2025 Maintainer

Uh oh!

danielzuegner Jul 24, 2025 Maintainer

Uh oh!

wigging Jul 24, 2025 Author

Uh oh!

danielzuegner Jul 25, 2025 Maintainer

Uh oh!

wigging Jul 29, 2025 Author

Uh oh!

wigging Jul 29, 2025 Author

wigging
Jul 8, 2025

Replies: 1 comment 12 replies

danielzuegner
Jul 11, 2025
Maintainer

danielzuegner Jul 24, 2025
Maintainer

wigging Jul 24, 2025
Author

danielzuegner Jul 25, 2025
Maintainer

wigging Jul 29, 2025
Author

wigging Jul 29, 2025
Author