cuDNN erro when using convolution layer

Vendrick17 · April 2, 2020, 7:46am

Hi everyone, I have an error message when I try to run a convolutional layer, the convolution layer is the following:

self.l_conv = nn.Conv2d(in_channels=self.arch['out_channels'][-1],out_channels=channels,kernel_size=3,padding=1)

Part of the error message is the following:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-b5acd2999739> in <module>()
     34         fake_label = torch.randint(0,num_classes,(n_batch,))
     35         fake_label = Variable(fake_label.cuda())
---> 36         G_out = netG(in_noise,fake_label)
     37 
     38         D_out_fake = netD(G_out,fake_label).squeeze()

<ipython-input-12-09ab6878f79c> in forward(self, z, y)
     65         out = self.bn(h)
     66         out = self.act(h)
---> 67         out = self.l_conv(h)
     68         out = torch.tanh(out)
     69 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py in forward(self, input)
    343 
    344     def forward(self, input):
--> 345         return self.conv2d_forward(input, self.weight)
    346 
    347 class Conv3d(_ConvNd):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py in conv2d_forward(self, input, weight)
    340                             _pair(0), self.dilation, self.groups)
    341         return F.conv2d(input, weight, self.bias, self.stride,
--> 342                         self.padding, self.dilation, self.groups)
    343 
    344     def forward(self, input):

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

This line works perfectly when I run it outside my training loop, so I am a bit confused about what is wrong. Thanks in advance.

yash1994 · April 2, 2020, 8:29am

Hi,

There are two possible reasons for this kind of error to be raised: 1. there was a bug in CuDNN which was resolved in later versions (so update CuDNN version if possible) 2. input to self.l_conv might be too large for computation so check the batch size and input dimension from the previous layer. Also, try to run your script on the CPU.

Vendrick17 · April 2, 2020, 8:44am

Thanks for repying, I am working on CoLab I don’t know if it is possible to install some updates there related with cuDNN, I tried reducing the batch size to 16 and the problem is still there. I am gonna try with the CPU now.

ptrblck · April 2, 2020, 9:19am

Could you post a code snippet to reproduce this issue, please?
We would need the convolution setup (in_channels, out_channels) as well as the shape of the input you are passing to this conv to debug this issue and check, if this bug was already fixed.

Vendrick17 · April 2, 2020, 9:53am

Sure, the class is the following:

class Generator(nn.Module):
    def __init__(self,n_gpu):
        super(Generator,self).__init__()
        self.n_gpu = n_gpu
        self.G_ch_mul = G_ch_mul  #Channel width multiplier
        self.resolution = resolution    # Resolution of the OUTPUT
        self.G_depth = G_depth          # Number of resblocks per stage
        self.arch = G_arch(self.G_ch_mul)[resolution]
        self.G_init = G_init

        self.linear = nn.Linear(noise_dim,self.arch['in_channels'][0]*(4**2))  #(N,noise-dim) > (N,64*16*4*4)

        self.blocks = []
        for index in range(len(self.arch['out_channels'])):                  #In total we have 5 cycles
          self.blocks += [[GBlock(input_ch=self.arch['in_channels'][index],  #(16,16,8,4,2) * 64
                                  output_ch=self.arch['in_channels'][index] if g_index==0 else self.arch['out_channels'][index], #((16,16),(16,8),(8,4),(4,2),(2,1))*64  (O1,O2)
                                  upsample=(functools.partial(F.interpolate, scale_factor=2) if self.arch['upsample'][index] and g_index == (self.G_depth-1) else None))]
                         for g_index in range(self.G_depth)]

          if self.arch['attention'][self.arch['resolution'][index]]:
            print('Adding attention layer in G at resolution %d' % self.arch['resolution'][index])
            self.blocks[-1] += [Attention(self.arch['out_channels'][index], SNConv2d)]

        self.blocks = nn.ModuleList([nn.ModuleList(block) for block in self.blocks])

        #self.output_layer = nn.Sequential(nn.BatchNorm2d(self.arch['out_channels'][-1]),
        #                                  nn.ReLU(True),
        #                                  nn.Conv2d(self.arch['out_channels'][-1],channels,kernel_size=3,padding=1))

        self.bn = nn.BatchNorm2d(self.arch['out_channels'][-1])
        self.act = nn.ReLU(True)
        self.l_conv = nn.Conv2d(in_channels=self.arch['out_channels'][-1],out_channels=channels,kernel_size=3,padding=1)
     
    def forward(self,z,y):

        h = self.linear(z)   #(N,noise_dim) > (N,64*4*4*16)
        h = h.view(h.size(0),-1,4,4)  #(N,64*4*4*16) > (N,16*64,4,4)
        for index, blocklist in enumerate(self.blocks):
            # Second inner loop in case block has multiple layers
            for block in blocklist:
              h = block(h, y)
        out = self.bn(h)
        out = self.act(h)
        out = self.l_conv(h)
        out = torch.tanh(out)

        return out

I don’t know how useful it is since the same line works well when it is outside my trianing loop, the in_channels=128, out_channels=1, Win=128 and Hin=128

Vendrick17 · April 2, 2020, 10:03am

Oh and I had the same issue before but it was when I was using Discriminator.backward(), I could solve it by changing some lines in my code, I was using the following structure:

        h = self.conv1(self.activation(self.bn1(z,y)))

I separated the code in different lines and the error message dissappeared but I raised the current one.