[UPDATE: I tried setting torch.backends.cudnn.enabled to False, and my code still crashes, so I guess it isn’t a CUDNN issue.] I just started using the CUDNN GRU, and I have been encountering some weird behavior. First, I was seeing the problem reported here: Cudnn_status_execution_failed (the first error reported by @Oana, not the second) . Reducing the batch size fixed this problem, but then on the 32nd training step I would suddenly run out of memory even though nvidia-smi reported I had over a gigabyte of free memory up until then. (However “nvidia-smi dmon” usually showed 100% memory usage. I am actually quite curious why this is, but I guess it is probably not the problem.) I tried using a shorter sequence to reduce the memory usage even further, and now it still crashes on the 32nd training step, but instead of getting an out of memory error, I get this:
step 29: loss 7.40 (10.82 smoothed)
step 30: loss 8.24 (10.73 smoothed)
step 31: loss 8.69 (10.67 smoothed)
Traceback (most recent call last):
File "train_spectral_model.py", line 49, in <module>
output = spectral_model(input, h0)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "/home/grant/repos/aud0/spectral_model.py", line 40, in forward
output, _ = self.rnn(input, h)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 162, in forward
output, hidden = func(input, self.all_weights, hx)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 351, in forward
return func(input, *fargs, **fkwargs)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 284, in _do_forward
flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 306, in forward
result = self.forward_extended(*nested_tensors)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended
cudnn.rnn.forward(self, input, hx, weight, output, hy)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 305, in forward
ctypes.c_void_p(fn.reserve.data_ptr()), fn.reserve.size(0)
RuntimeError: invalid argument 2: out of range at /home/grant/pubrepos/pytorch/torch/lib/THC/generic/THCTensor.c:23
Any ideas?
In case it is useful, when I use the longer sequence and it runs out of memory, the traceback looks like this:
step 29: loss 8.15 (10.91 smoothed)
step 30: loss 8.94 (10.84 smoothed)
step 31: loss 9.41 (10.79 smoothed)
THCudaCheck FAIL file=/home/grant/pubrepos/pytorch/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
File "train_spectral_model.py", line 49, in <module>
output = spectral_model(input, h0)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "/home/grant/repos/aud0/spectral_model.py", line 40, in forward
output, _ = self.rnn(input, h)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
result = self.forward(*input, **kwargs)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 162, in forward
output, hidden = func(input, self.all_weights, hx)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 351, in forward
return func(input, *fargs, **fkwargs)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 284, in _do_forward
flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 306, in forward
result = self.forward_extended(*nested_tensors)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 293, in forward_extended
cudnn.rnn.forward(self, input, hx, weight, output, hy)
File "/home/grant/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 291, in forward
fn.reserve = torch.cuda.ByteTensor(reserve_size.value)
RuntimeError: cuda runtime error (2) : out of memory at /home/grant/pubrepos/pytorch/torch/lib/THC/generic/THCStorage.cu:66