Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA out of memory,continuous training? #319

Open
watertianyi opened this issue Mar 14, 2023 · 5 comments
Open

RuntimeError: CUDA out of memory,continuous training? #319

watertianyi opened this issue Mar 14, 2023 · 5 comments

Comments

@watertianyi
Copy link

I have trained 3000 pairs of data, and want to add another 2000 pairs to continue training, using the following command:
python train.py --name comics --dataroot ./datasets/comics3Kto5K --loadSize 512 --label_nc 0 --no_instance --netG local --load_pretrain checkpoints0310/comics/

But the error is as follows:
#RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 11.76 GiB total capacity; 8.86 GiB already allocated; 113.56 MiB free; 8.91 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
What is going on?

@takuyaliu
Copy link

Restart

@watertianyi
Copy link
Author

@takuyaliu If training is interrupted, can't I continue training from the breakpoint?

@takuyaliu
Copy link

Of course you can train it from the latest breakpoint, you can find it in base options and train options.

@watertianyi
Copy link
Author

@takuyaliu
How should set the parameters, thank you for your reply!

@takuyaliu
Copy link

@takuyaliu How should set the parameters, thank you for your reply!

You can find it in 'options/train_options.py'

for training

    self.parser.add_argument('--continue_train', action='store_true', help='continue training: load the latest model')
    self.parser.add_argument('--which_epoch', type=str, default='latest', help='which epoch to load? set to latest to use latest cached model')

try to add '--continue_train --which_epoch latest' after your training command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants