GRU crashing after 31 training steps

[UPDATE: I tried setting torch.backends.cudnn.enabled to False, and my code still crashes, so I guess it isn’t a CUDNN issue.] I just started using the CUDNN GRU, and I have been encountering some weird behavior. First, I was seeing the problem reported here: Cudnn_status_execution_failed (the first error reported by @Oana, not the second) . Reducing the batch size fixed this problem, but then on the 32nd training step I would suddenly run out of memory even though nvidia-smi reported I had over a gigabyte of free memory up until then. (However “nvidia-smi dmon” usually showed 100% memory usage. I am actually quite curious why this is, but I guess it is probably not the problem.) I tried using a shorter sequence to reduce the memory usage even further, and now it still crashes on the 32nd training step, but instead of getting an out of memory error, I get this:

step 29: loss 7.40 (10.82 smoothed)
step 30: loss 8.24 (10.73 smoothed)
step 31: loss 8.69 (10.67 smoothed)
Traceback (most recent call last):
  File "train_spectral_model.py", line 49, in <module>
    output = spectral_model(input, h0)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/grant/repos/aud0/spectral_model.py", line 40, in forward
    output, _ = self.rnn(input, h)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 162, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 351, in forward
    return func(input, *fargs, **fkwargs)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 284, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 306, in forward
    result = self.forward_extended(*nested_tensors)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 305, in forward
    ctypes.c_void_p(fn.reserve.data_ptr()), fn.reserve.size(0)
RuntimeError: invalid argument 2: out of range at /home/grant/pubrepos/pytorch/torch/lib/THC/generic/THCTensor.c:23

Any ideas?

In case it is useful, when I use the longer sequence and it runs out of memory, the traceback looks like this:

step 29: loss 8.15 (10.91 smoothed)
step 30: loss 8.94 (10.84 smoothed)
step 31: loss 9.41 (10.79 smoothed)
THCudaCheck FAIL file=/home/grant/pubrepos/pytorch/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "train_spectral_model.py", line 49, in <module>
    output = spectral_model(input, h0)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/grant/repos/aud0/spectral_model.py", line 40, in forward
    output, _ = self.rnn(input, h)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 162, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 351, in forward
    return func(input, *fargs, **fkwargs)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 284, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 306, in forward
    result = self.forward_extended(*nested_tensors)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 291, in forward
    fn.reserve = torch.cuda.ByteTensor(reserve_size.value)
RuntimeError: cuda runtime error (2) : out of memory at /home/grant/pubrepos/pytorch/torch/lib/THC/generic/THCStorage.cu:66

Your first error still looks like not properly caught out of memory error. What likely happens is workspace tensor cannot be properly allocated at https://github.com/pytorch/pytorch/blob/master/torch/backends/cudnn/rnn.py#L280-L281 and then when workspace.size(0) is called, out of range error is reported.
Usually if you can get through 1 iteration, memory usage should not grow, so I’m not sure what happens. Are you properly detaching all your variables so that backpropagation graph does not grow indefinitely?

Thanks. Possibly this is the issue. I am not calling detach, but I just throw away the final hidden state, and then run the RNN on a fresh sequence (starting with a fixed initial hidden state that currently has requires_grad set to False, although I don’t think that should matter). So I don’t think I need to call detach. Also, if I am just not freeing memory properly, it is confusing why it always happens on the 32nd iteration even when changing parameters like batch size and sequence length, and it is confusing why nvidia-smi shows constant memory usage from the first iteration right until the crash.

Also, incidentally, can you explain why nvidia-smi dmon often shows memory usage almost always close to 100% while nvidia-smi shows lower memory usage? I don’t think this has anything to do with my problem since I have observed it in other cases, but I still find it a mystery.

You are right, in your case you don’t have to detach. Sorry, I don’t know why dmon is showing different memory usage than regular nvidia-smi.

@greaber if you can give a repro script, i will get this investigated / fixed.

I investigated further, and the issue seems to be that after 31 training steps the model would always get an extra long sequence and run out of memory, just as @ngimel initially surmised. This took me longer to realize than it should have, in part because pytorch doesn’t reliably give an error message that says “out of memory” when this happens. Depending on the details of the model and input, it can give other error messages like

torch.backends.cudnn.CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'

and

RuntimeError: invalid argument 2: out of range at /home/grant/pubrepos/pytorch/torch/lib/THC/generic/THCTensor.c:23
1 Like