Cuda 10.1 error using transposeconv2d with output_padding 1

AlphabetMan · July 23, 2019, 4:38pm

Might be off topic, but I did not find where to report a bug on nvidia website. I tried training a GAN based on pix2pixhd architecture with Amp ‘O1’ opt_level with cuda 10.1 cudnn 7.6 pytorch nightly build ubuntu 18.04.
Code breaks here:
File “./samplegenerator.py”, line 263, in
scaled_loss.backward()
File “/usr/local/lib/python3.6/dist-packages/torch/tensor.py”, line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py”, line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: _cublasOpFromChar input should be ‘t’, ‘n’ or ‘c’ but got `

This kind of code should reproduce the behaviour:

class Modelis(nn.Module):
   def __init__(self):
       super().__init__()
       self.conv = nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, 
                             stride=2, padding=1)
       self.deconv = nn.ConvTranspose2d(in_channels=256, out_channels=128, 
           kernel_size=3, stride=2, padding=1, output_padding=1)
   def forward(self, x):
       x = self.conv(x)
       x = self.deconv(x)
       return x

Criterion = nn.bcewithlogits()
netG = Modelis()
netG = netG.cuda()
optimizerG = optim.Adam(netG.parameters(), lr=0.001, betas=(0.5, 0.999))
netG, optimizerG = amp.initialize(netG, optimizerG, opt_level='O1') 

for i in range(100):
    batch = (torch.randn(8,128,16,16).cuda() - 0.5) * 2
    output = netG(batch)
    loss = Criterion(output, torch.ones_like(output))

    with amp.scale_loss(loss, optimizerG) as scaled_loss:
        scaled_loss.backward()

It works in fp32 mode. After experiments have found that backward breaks on nn.transposeconv2d when output_padding is 1 (not 0). Using pytorch docker container (cuda 10.0, cudnn 7, pytorch 1.0) both fp32 and fp16 works fine.

ptrblck · July 23, 2019, 10:40pm

Thanks for reporting this issue. We’ll look into it.

ptrblck · July 30, 2019, 12:56pm

We could reproduce this issue and are tracking it here.
Thanks @ngimel for the support tracking down this issue.