I wrote some LSTM based code for language modeling:
def forward(self, input, hidden):
emb = self.encoder(input)
h, c = hidden
h.data.squeeze_(0)
c.data.squeeze_(0)
seq_len = input.size(0)
batch_size = input.size(1)
output_dim = h.size(1)
output = []
for i in range(seq_len):
h, c = self.rnncell(emb[i], (h, c))
# self.hiddens: time * batch * nhid
if i == 0:
self.hiddens = h.unsqueeze(0)
else:
self.hiddens = torch.cat([self.hiddens, h.unsqueeze(0)])
# h: batch * nhid
#self.att = h.unsqueeze(0).expand_as(self.hiddens)
self.hiddens = self.hiddens.view(-1, self.nhid)
b = torch.mm(self.hiddens, self.U).view(-1, batch_size, 1)
a = torch.mm(h, self.W).unsqueeze(0).expand_as(b)
att = torch.tanh(a + b).view(-1, batch_size)
att = self.softmax(att.t()).t()
self.hiddens = self.hiddens.view(-1, batch_size, self.nhid)
att = att.unsqueeze(2).expand_as(self.hiddens)
output.append(torch.sum(att * self.hiddens, 0)) #hidden.data
output = torch.cat(output)
decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
decoded = self.logsoftmax(decoded)
output = decoded.view(output.size(0), output.size(1), decoded.size(1))
return output, (h, c)
And I got error in backward():
RuntimeError: cuda runtime error (2) : out of memory at /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.9_1487343590888/work/torch/lib/THC/generic/THCStorage.cu:66
Any ideas why it might happen?
The memory goes to 5800MB very quickly in the first 10 batches, and then it keeps running with this much memory occupied for another several hundred batches, and then it runs out of memory.
@ZeweiChu yes, it’s good practice to make your model stateless. It’s best if you only keep references to parameters, and all intermediate values generated in forward are not saved anywhere for extended periods of time.
def repackage_variable(v, volatile=False):
return [Variable(torch.from_numpy(h), volatile=volatile).unsqueeze(1) for h in v]
for k in range(len(minbatches)):
minbatch = minbatches[perm[k]]
x_padded = utils.make_mask(minbatch)
x_padded = repackage_variable(x_padded, False)
x_padded = torch.cat(x_padded, 1)
T = x_padded.size(0)
B = x_padded.size(1)
inp = x_padded[:T-1, :].long()
target = x_padded[1:, :].long().view(-1, 1)
if use_cuda:
inp = inp.cuda()
target = target.cuda()
mask = (inp != 0).float().view(-1, 1)
hidden = model.init_hidden(batch)
model.zero_grad()
#print(inp.size())
output, hidden = model(inp, hidden)
output = output.view(-1, n_vocab)
loss = output.gather(1, target) * mask
loss = -torch.sum(loss) / torch.sum(mask)
loss.backward()
optimizer.step()
My question is, at each iteration, since all "Variable"s “inp” and “target” are overwritten, will the model state variables like “self.hiddens” also be overwritten? Does the old computation graph still exist in the next iteration?
nvidia-smi shows that about 6G of memory is used, but I am only testing on batch size of 50, and the length should be at most 200, why would it take up so much memory? And the memory size increases among iterations from time to time, but it could stay the same for a while. Any clues what might be the reason?
@ruotianluo how would they get cleaned up? It’s a reference. We’ll free most of the buffers, but I think there might still be some of them alive. This is going to change in the upcoming releases btw.
@ZeweiChu I can’t see anything wrong with your example. The only suggestion would be to convert the input into Variables as late as you can (e.g. do cat, type casts and copies on tensors not Variables). Maybe that’s how much memory your model requires. Are you sure it can even fit in memory?
Any progress on this one? I am facing a similar issue. I have implemented an LSTM and the memory remains constant for about 9000 iterations after which it runs out of memory. I am not keeping any references of the intermediate Variables.
I am running this on a 12GB Titan X GPU on a shared server.
Finally fixed it. There was problem in my code. I was unaware that x = y[a:b] is not a deep copy of y. I was modifying x, and in turn modifying y, and increasing the size of the data in every iteration. Using x = copy.deepcopy(y[a:b]) fixed it for me.