Minimal code snippet that seems to cause a memory leak on GPU code only

Sorry for spawning yet another memory leak thread. I’ve gone through the previous ones and didn’t find that they were the same issue (or version). But perhaps I’m mistaken.

In relation to a few previous posts I’ve made, (specifically on working with seq2seq training models, and the fact that LSTMCell's aren’t cuda enabled) I’ve come to a place where I have to iterate a sequence one-by-one through LSTM layer to generate my seq2seq (this is the motivation of the code below which corresponds to the decoder half of a VRAE).

So I’ve sprung a dreaded memory leak, and I’m not sure why. Here’s my minimal code:

lstm = nn.LSTM( 5, 512, 2 ).double().cuda()
ll = nn.Linear( 512, 5 ).double().cuda()
h_t = Variable( torch.cuda.DoubleTensor(2, 1, 512) , requires_grad=False).cuda()
c_t = Variable( torch.cuda.DoubleTensor(2, 1, 512) , requires_grad=False).cuda()
out = Variable( torch.cuda.DoubleTensor(1, 1, 5 ), requires_grad=False).cuda()
out, (h_t, c_t) = test_lstm( out , (h_t, c_t) ) # <- warmup the run - first memory reading
out = test_ll(out.squeeze(1)).unsqueeze(1)
for i in range( 200 ):
	out, (h_t, c_t) = lstm( out , (h_t, c_t) )
	out = ll(out.squeeze(1)).unsqueeze(1)
    print( "%d %d" % (i,

This code works without a hitch on the cpu.

On the gpu, I start with process GPU consumption of 189MiB at start, 413MiB at the first checkpoint, and then the following output:

1 780961
2 803360
66 886347
67 903248
68 921229
THCudaCheck FAIL file=/py/conda-bld/pytorch_1490980628440/work/torch/lib/THC/generic/ line=66 error=2 : out of memory
Traceback (most recent call last):
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/IPython/core/", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-27-226b10408002>", line 2, in <module>
    out, (h_t, c_t) = test_lstm( out , (h_t, c_t) )
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/torch/nn/modules/", line 91, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/", line 327, in forward
    return func(input, *fargs, **fkwargs)
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/torch/autograd/", line 202, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/torch/autograd/", line 224, in forward
    result = self.forward_extended(*nested_tensors)
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/", line 269, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/usr/bin/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/", line 247, in forward
    fn.weight_buf =
RuntimeError: cuda runtime error (2) : out of memory at /py/conda-bld/pytorch_1490980628440/work/torch/lib/THC/generic/

and at this point, I’m pegged at 4GiB of memory. In the span of 140ms.

Nvidia driver version is 375.39,
nvcc is 8.0, V8.0.61
pytorch 0.1.11+27fb875

Moreover, nothing I do at this point frees that memory and I have to respawn my process.

Edit: running this with Variables marked as volatile instead of !requires_grad doesn’t end up in a memory problem.

My hope is that I’m doing something stupid. Let me know if you have any questions about setup.

It’s a known issue with nn.LSTM module (and nn.GRU too). It’s not a leak, but they use too much memory when using to iterate over inputs. You should use the LSTMCell. It works with CUDA too and should be reasonably fast.

Alternatively, you can disable cuDNN using torch.backends.cudnn.enabled = False.

Ahh, thanks (and drats).

Following this thread, I had decided to not make use of LSTMCell's, but I guess I need to go back on that decision.

Before I go too far down this track: is it possible or even supported to swap paramters of an LSTM in and out of LSTMCell's on the fly?

Thanks for your time.

Edit: I just realized probably an even simpler solution to this whole problem is to transfer my model back and forth to the CPU using pinned memory. Way better solution than to rearchitect everything.

Hi @MBlah, sorry to revive this after a long time, but I seem to be facing the same problem. Could you probably give snippet about how you transfer your model back and forth?

I ended up changing my architecture specifically to work around this issue.

For inference mode, I simply use volatile=True and this solves everything.

Otherwise, you can move your model using .cpu() and .cuda() calls.

E.g. if seq is a class module, then you can do:

output = seq(input)

Furthermore, you can call pin_memory() on the input tensors to facilitate transfers back and forth.

It’s very hacky, and in the end, I found an alternate way of training my model.

@apaszke Thanks for the hint. Is there other reference link about nn.LSTM’s memory issue when cuDNN enabled? I was also struggling with OOM issue until reading this post…