You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you please provide clarity on what kind of GPU resources are recommended to train with the full graph? As kgwas.py train() does not allow for the specification of multiple CUDA devices, does that suggest this should be able to run on just 1 device?
I am continually running into CUDA OOM errors even when lowering the batch size significantly. I can train through 10 epochs (memory reaching near 21GB out of 24 GB available), however, for reasons unclear to me it will error out during the final step of saving the model predictions/results. If relevant, I also had to change num_workers=1 to avoid DataLoader errors.
I have been able to train through 10 epochs/save results with the full graph on CPU high memory, so I know I am not encountering other issues, but that is of course undesirably slow.
Additionally, given that all of the benchmarking done in the paper was done with the full graph, did you do any benchmarking between the "fast" Enformer/ESM graph and the full graph?
Thanks for your guidance!
The text was updated successfully, but these errors were encountered:
laurendonoghue
changed the title
instance recommendation for using full graph
instance recommendation for full graph
Feb 6, 2025
Congratulations on this exciting new method!
Could you please provide clarity on what kind of GPU resources are recommended to train with the full graph? As kgwas.py train() does not allow for the specification of multiple CUDA devices, does that suggest this should be able to run on just 1 device?
I am continually running into CUDA OOM errors even when lowering the batch size significantly. I can train through 10 epochs (memory reaching near 21GB out of 24 GB available), however, for reasons unclear to me it will error out during the final step of saving the model predictions/results. If relevant, I also had to change num_workers=1 to avoid DataLoader errors.
I have been able to train through 10 epochs/save results with the full graph on CPU high memory, so I know I am not encountering other issues, but that is of course undesirably slow.
Additionally, given that all of the benchmarking done in the paper was done with the full graph, did you do any benchmarking between the "fast" Enformer/ESM graph and the full graph?
Thanks for your guidance!
The text was updated successfully, but these errors were encountered: