CUDNN_STATUS_NOT_INITIALIZED error when using batch size > 1


(Hm2092) #1

Hi Everyone.
I ran into RuntimeError: CUDNN_STATUS_NOT_INITIALIZED, while trying a batch_size > 1 for training(and also validation). My system details are:
Cuda version: 9.0.176
Cudnn version: 7102
Pytorch version: 0.4.0
GPU: GTX 1080 Ti
Driver version: 390.77
OS: Ubuntu 16.04

With a batch_size of 1, the training loop works fine. This problem arises when I increase the batch size. Detailed traceback is given below:

<ipython-input-20-8f012bcb5dc5> in forward(self, x)
      9 
     10     def forward(self, x):
---> 11         x = self.conv3d_1(x)
     12         x = self.conv3d_2(x)
     13         x = self.conv3d_3(x)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py in forward(self, input)
     89     def forward(self, input):
     90         for module in self._modules.values():
---> 91             input = module(input)
     92         return input
     93 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py in forward(self, input)
    419     def forward(self, input):
    420         return F.conv3d(input, self.weight, self.bias, self.stride,
--> 421                         self.padding, self.dilation, self.groups)
    422 
RuntimeError: CUDNN_STATUS_NOT_INITIALIZED

I realized that in this issue (CUDNN_STATUS_NOT_INITIALIZED when using cnn) also, the error occured for conv1D layer and in my case, in conv3D layer.
Any help would be great. Thank you in advance.


(Hm2092) #2

I found the solution. I was using a custom collate function for dataloader and one of the variables was initialized wrong. I was copying a tensor (index wise) to another, based on the indices in the wrongly initialized variable right before the conv3d layer. So it was more like an “out of range” error or something and I was confused by the CUDNN_STATUS_NOT_INITIALIZED error statement.