Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM with clearState() and DataParallelTable #441

Open
joeyhng opened this issue Feb 11, 2017 · 2 comments
Open

OOM with clearState() and DataParallelTable #441

joeyhng opened this issue Feb 11, 2017 · 2 comments

Comments

@joeyhng
Copy link

joeyhng commented Feb 11, 2017

I wanted to reduce the size of the model for saving, but got out of memory when doing clearState()

Here is the code that gives error:

require 'nn'
require 'cunn'
require 'cudnn'

model = nn.Sequential()
          :add(cudnn.SpatialConvolution(3, 512, 3,3, 1,1, 1,1))
          :add(nn.View(-1):setNumInputDims(3))
          :add(nn.Mean(1,1)):cuda()

model = nn.DataParallelTable(1, true, true)
          :add(model, {1,2,3,4})
          :threads(function()
            cudnn = require 'cudnn'
          end)

input = torch.randn(128,3,224,224):cuda()
grad =  torch.randn(128):cuda()

for i = 1,1000 do
  print('iteration ' .. tostring(i))
  model:forward(input)
  model:backward(input, grad)
  model:clearState()   
  collectgarbage() collectgarbage()
end

From my understanding the code should be able to finish running. But the memory is not freed successfully and it gives out of memory error after some iterations. I got out of memory error at the 7th iteration with 4 K80.

The code seems to run fine without DataParallelTable, so I guess it should be something about DataParallelTable.

Thanks!

@achalddave
Copy link

achalddave commented Feb 17, 2017

I have what seems to be the exact same issue. I thought I was going crazy; my model trains fine for many iterations, but then when I call clearState, then save the model, then re-start training, I run out of memory.

Unfortunately, I'm not able to replicate it with the code you posted on a Titan X (Pascal) with a higher batch size, but I do see the same symptoms with a model containing custom layers. I've been trying on-and-off for days to see if I can isolate this issue, but haven't had luck yet :(. Any leads or suggestions would be great!

@achalddave
Copy link

Update: For me, the issue seems to be with nn.MapTable, and not DataParallellTable. I've filed my issue separately here: torch/nn#1141

achalddave added a commit to achalddave/predictive-corrective that referenced this issue Jul 20, 2017
Calling clearState() seems to cause issues that, after 4-5 days of
debugging, I haven't been able to fix. See, for example:

torch/nn#1141
torch/cunn#441

Further, it's unclear to me if `getParameters` and memory management in
general works well when a call to `clearState` can destroy modules (and
therefore weight tensors). The easiest solution to all of this is simply
to never call clearState on the model while it is training.

When saving the model, we create a copy of it on the CPU, and call
clearState on this CPU copy, which we then save to disk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants