CUDNN_STATUS_NOT_INITIALIZED error when using batch size > 1

Hi Everyone.
I ran into RuntimeError: CUDNN_STATUS_NOT_INITIALIZED, while trying a batch_size > 1 for training(and also validation). My system details are:
Cuda version: 9.0.176
Cudnn version: 7102
Pytorch version: 0.4.0
GPU: GTX 1080 Ti
Driver version: 390.77
OS: Ubuntu 16.04

With a batch_size of 1, the training loop works fine. This problem arises when I increase the batch size. Detailed traceback is given below:

<ipython-input-20-8f012bcb5dc5> in forward(self, x)
     10     def forward(self, x):
---> 11         x = self.conv3d_1(x)
     12         x = self.conv3d_2(x)
     13         x = self.conv3d_3(x)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/ in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/ in forward(self, input)
     89     def forward(self, input):
     90         for module in self._modules.values():
---> 91             input = module(input)
     92         return input

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/ in __call__(self, *input, **kwargs)
    489             result = self._slow_forward(*input, **kwargs)
    490         else:
--> 491             result = self.forward(*input, **kwargs)
    492         for hook in self._forward_hooks.values():
    493             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/ in forward(self, input)
    419     def forward(self, input):
    420         return F.conv3d(input, self.weight, self.bias, self.stride,
--> 421                         self.padding, self.dilation, self.groups)

I realized that in this issue (CUDNN_STATUS_NOT_INITIALIZED when using cnn) also, the error occured for conv1D layer and in my case, in conv3D layer.
Any help would be great. Thank you in advance.

I found the solution. I was using a custom collate function for dataloader and one of the variables was initialized wrong. I was copying a tensor (index wise) to another, based on the indices in the wrongly initialized variable right before the conv3d layer. So it was more like an “out of range” error or something and I was confused by the CUDNN_STATUS_NOT_INITIALIZED error statement.

@hm2092 thank you for sharing your error - I think I am experiencing something very similar where the cuDNN message only happens for batch size > 1. Could you explain more about how you debugged the location of the “out of range” error? My DataLoader + custom collate_fn is fairly straightforward and there are no issues preparing the batches

On a separate note, why did PyTorch only implement a Dataset class for COCO, but not a DataLoader and/or collate function? It seems this is standard enough that plenty of other people are using it…

from spellchecker import SpellChecker
import torch
import torch.nn as nn
from torchvision import transforms
from torchvision.datasets import CocoCaptions

class CocoCollate:
    """Custom collate function for COCO captions dataset

    :param dict stoi: Mapping defining integer values for each word in the vocabulary

    def __init__(self, stoi, padding_idx=-1):
        self.stoi = stoi
        self.unk = len(self.stoi)
        self.padding_idx = padding_idx

    def __call__(self, batch):
        """Randomly sample caption and numericalize using mapping

        :param tuple(torch.tensor, [[str]]) batch: 3-channel image data and nested
            lists of potential captions (5 per sample)
        :return tuple(torch.tensor): Batched images and numericalized captions
        images, captions = zip(*batch)
        spell = SpellChecker(distance=1)
        selected = [[word if word in self.stoi else spell.correction(word)
                     for word in re.findall(r"[<>\w']+|[.,!?;]", pick)]
                    for pick in [f'<s> {c[random.randrange(5)].lower()} </s>' for c in captions]]
        numericalized = [torch.tensor([self.stoi.get(word, self.unk) for word in s]) for s in selected]
        packed = nn.utils.rnn.pad_sequence(numericalized, batch_first=True, padding_value=self.padding_idx)
        return torch.stack(images), packed

transform = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
trainset = datasets.CocoCaptions('/path/to/images', '/path/to/captions', transform)
vocabulary = [ ... ]
stoi = {word: idx for idx, word in enumerate(vocabulary)}
loader = DataLoader(trainset, batch_size=2, collate_fn=CocoCollate(stoi))

Hi @addisonklinke. In my case i made a mistake with the index corresponding to batch size. Whenever i used a batch size > 1 (rarely because of gpu restrictions), i could not copy the whole tensor into the new one, which resulted in the error. But my traceback pointed to conv3d layer(as shown in my post) instead of the copy operation 2 lines before, which made it confusing.

For your second question: I havn’t used pytorch > 0.4 or coco dataset, so i am not sure about it. One thing i know is that the default collate function (pytorch 0.4) use torch.stack function call, which expects the input tensors to have same shape (i had input tensors of different shape). So if coco has images of same shape, then there is no need to use custom collate function. I am neither familiar with captions in your case nor with the latest pytorch releases, sorry.

@hm2092 If it was just the COCO images, then you are correct that the default collate function would work fine. The need for a custom one is to handle the captions. The __getitem__ method of the built-in CocoCaptions dataset class returns the images and a list of strings, so my collate function converts the strings to word indices and pads them to equal length so that torch.stack will work

This is with pytorch 1.4.0 installed through conda, so I may try downgrading to 1.2 or 1.3

Edit: Found out my issue was passing an out-of-bounds index (-1 in my case) to nn.Embedding. Previous discussion from other users around embedding-related errors can be found on this thread

1 Like