Cuda runtime error (2): out of memory

I wrote some LSTM based code for language modeling:

 def forward(self, input, hidden):
         emb = self.encoder(input)
         h, c = hidden

         seq_len = input.size(0)
         batch_size = input.size(1)
         output_dim = h.size(1)

         output = [] 
         for i in range(seq_len):        
             h, c = self.rnncell(emb[i], (h, c))
             # self.hiddens: time * batch * nhid
             if i == 0:
                 self.hiddens = h.unsqueeze(0)
                 self.hiddens =[self.hiddens, h.unsqueeze(0)])
             # h: batch * nhid
             #self.att = h.unsqueeze(0).expand_as(self.hiddens)

             self.hiddens = self.hiddens.view(-1, self.nhid)
             b =, self.U).view(-1, batch_size, 1)
             a =, self.W).unsqueeze(0).expand_as(b)
             att = torch.tanh(a + b).view(-1, batch_size)
             att = self.softmax(att.t()).t()
             self.hiddens = self.hiddens.view(-1, batch_size, self.nhid)
             att = att.unsqueeze(2).expand_as(self.hiddens)
             output.append(torch.sum(att * self.hiddens, 0))

         output =

         decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
         decoded = self.logsoftmax(decoded)
         output = decoded.view(output.size(0), output.size(1), decoded.size(1)) 
         return output, (h, c)

And I got error in backward():

RuntimeError: cuda runtime error (2) : out of memory at /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.9_1487343590888/work/torch/lib/THC/generic/

Any ideas why it might happen?

The memory goes to 5800MB very quickly in the first 10 batches, and then it keeps running with this much memory occupied for another several hundred batches, and then it runs out of memory.


Is there any reason why you’re keeping self.hiddens?

No, I don’t have to keep it. Is it a bad thing to keep unnecessary variables in the model?

if you keep Variables around, the corresponding graph that created these Variables is kept around. Hence the elevated memory usage…


@ZeweiChu yes, it’s good practice to make your model stateless. It’s best if you only keep references to parameters, and all intermediate values generated in forward are not saved anywhere for extended periods of time.


The main part of my code looks like this.

 def repackage_variable(v, volatile=False):
     return [Variable(torch.from_numpy(h), volatile=volatile).unsqueeze(1) for h in v]

 for k in range(len(minbatches)):
         minbatch = minbatches[perm[k]]
         x_padded = utils.make_mask(minbatch)
         x_padded = repackage_variable(x_padded, False)
         x_padded =, 1)
         T = x_padded.size(0)
         B = x_padded.size(1)
         inp = x_padded[:T-1, :].long()
         target = x_padded[1:, :].long().view(-1, 1)
         if use_cuda:
             inp = inp.cuda()
             target = target.cuda()       

         mask = (inp != 0).float().view(-1, 1)
         hidden = model.init_hidden(batch)
         output, hidden = model(inp, hidden)
         output = output.view(-1, n_vocab)
         loss = output.gather(1, target) * mask
         loss = -torch.sum(loss) / torch.sum(mask)


My question is, at each iteration, since all "Variable"s “inp” and “target” are overwritten, will the model state variables like “self.hiddens” also be overwritten? Does the old computation graph still exist in the next iteration?

nvidia-smi shows that about 6G of memory is used, but I am only testing on batch size of 50, and the length should be at most 200, why would it take up so much memory? And the memory size increases among iterations from time to time, but it could stay the same for a while. Any clues what might be the reason?

Won’t self.hiddens be cleaned after backward?

@ruotianluo how would they get cleaned up? It’s a reference. We’ll free most of the buffers, but I think there might still be some of them alive. This is going to change in the upcoming releases btw.

@ZeweiChu I can’t see anything wrong with your example. The only suggestion would be to convert the input into Variables as late as you can (e.g. do cat, type casts and copies on tensors not Variables). Maybe that’s how much memory your model requires. Are you sure it can even fit in memory?


So, the reference should be cleaned up after self.hiddens is overwritten by next forward? Is it correct?

Yes. It won’t be kept there indefinitely, but it still can postpone some frees and increase the overall memory usage.

Any progress on this one? I am facing a similar issue. I have implemented an LSTM and the memory remains constant for about 9000 iterations after which it runs out of memory. I am not keeping any references of the intermediate Variables.

I am running this on a 12GB Titan X GPU on a shared server.

Finally fixed it. There was problem in my code. I was unaware that x = y[a:b] is not a deep copy of y. I was modifying x, and in turn modifying y, and increasing the size of the data in every iteration. Using x = copy.deepcopy(y[a:b]) fixed it for me.


So did you figure out why your memory usage keeps increasing? I had the exact same question as you did. Thanks.

How can I manually free the memory? For example, how would you clean up self.hidden here?

why moidifying y will increase size of the data? i have similar problems

I was suggested to do del self.hidden before return output