Might be off topic, but I did not find where to report a bug on nvidia website. I tried training a GAN based on pix2pixhd architecture with Amp ‘O1’ opt_level with cuda 10.1 cudnn 7.6 pytorch nightly build ubuntu 18.04.
Code breaks here:
File “./samplegenerator.py”, line 263, in
scaled_loss.backward()
File “/usr/local/lib/python3.6/dist-packages/torch/tensor.py”, line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py”, line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: _cublasOpFromChar input should be ‘t’, ‘n’ or ‘c’ but got `
This kind of code should reproduce the behaviour:
class Modelis(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3,
stride=2, padding=1)
self.deconv = nn.ConvTranspose2d(in_channels=256, out_channels=128,
kernel_size=3, stride=2, padding=1, output_padding=1)
def forward(self, x):
x = self.conv(x)
x = self.deconv(x)
return x
Criterion = nn.bcewithlogits()
netG = Modelis()
netG = netG.cuda()
optimizerG = optim.Adam(netG.parameters(), lr=0.001, betas=(0.5, 0.999))
netG, optimizerG = amp.initialize(netG, optimizerG, opt_level='O1')
for i in range(100):
batch = (torch.randn(8,128,16,16).cuda() - 0.5) * 2
output = netG(batch)
loss = Criterion(output, torch.ones_like(output))
with amp.scale_loss(loss, optimizerG) as scaled_loss:
scaled_loss.backward()
It works in fp32 mode. After experiments have found that backward breaks on nn.transposeconv2d when output_padding is 1 (not 0). Using pytorch docker container (cuda 10.0, cudnn 7, pytorch 1.0) both fp32 and fp16 works fine.