Replies: 1 comment 12 replies
-
|
Hi @wigging, did you verify that during training indeed all 8 GPUs were used? For MatterGen we always use a total batch size of 512, so for 4 GPUs this would be a per-GPU batch size of 128 and for 8 GPUs it is 64. You can increase this to make better use of your GPUs' memory, but you might have to tune the learning rate for the different batch size. |
Beta Was this translation helpful? Give feedback.
12 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I trained MatterGen with the MP-20 data using 4 GPUs where each GPU is an A100 with 80 GB of memory. The training stopped in about 4 hours and 50 minutes when it reached 899 epochs which is the default value for
max_epochsin the config file. I used the following command to run the training:Next, I ran the same training using 8 GPUs where each GPU is an A100 with 80 GB of memory. The training stopped in about 4 hours and 53 minutes when it reached 899 epochs. I used the command shown below to run the training:
I expected the training time to be shorter with 8 GPUs compared to 4 GPUs but it was about the same. Is MatterGen limited by the amount of memory that it can use? Is there a config setting that I need to adjust to take advantage of the extra memory provided by multiple GPUs?
Beta Was this translation helpful? Give feedback.
All reactions