One of the variables needed for gradient computation has been modified by an inplace op

Ilya_Kotlov · March 15, 2021, 2:57pm

Hi, will be glad for some help to solve this issue.

The problem is:
# RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

This is my code:

def train_single_scale(netD, netG, reals, Gs, Zs, in_s, NoiseAmp, opt, centers=None):
      real = reals[len(Gs)]
      opt.nzx = real.shape[2]  # +(opt.ker_size-1)*(opt.num_layer)
      opt.nzy = real.shape[3]  # +(opt.ker_size-1)*(opt.num_layer)
      opt.receptive_field = opt.ker_size + ((opt.ker_size - 1) * (opt.num_layer - 1)) * opt.stride
      pad_noise = int(((opt.ker_size - 1) * opt.num_layer) / 2)
      pad_image = int(((opt.ker_size - 1) * opt.num_layer) / 2)
      if opt.mode == 'animation_train':
          opt.nzx = real.shape[2] + (opt.ker_size - 1) * (opt.num_layer)
          opt.nzy = real.shape[3] + (opt.ker_size - 1) * (opt.num_layer)
          pad_noise = 0
      m_noise = nn.ZeroPad2d(int(pad_noise))
      m_image = nn.ZeroPad2d(int(pad_image))
  
      alpha = opt.alpha
  
      fixed_noise = functions.generate_noise([opt.nc_z, opt.nzx, opt.nzy], device=opt.device)
      z_opt = torch.full(fixed_noise.shape, 0, device=opt.device, dtype=torch.float32)
      z_opt = m_noise(z_opt)
  
      # setup optimizer
      optimizerD = optim.Adam(netD.parameters(), lr=opt.lr_d, betas=(opt.beta1, 0.999))
      optimizerG = optim.Adam(netG.parameters(), lr=opt.lr_g, betas=(opt.beta1, 0.999))
      schedulerD = torch.optim.lr_scheduler.MultiStepLR(optimizer=optimizerD, milestones=[1600], gamma=opt.gamma)
      schedulerG = torch.optim.lr_scheduler.MultiStepLR(optimizer=optimizerG, milestones=[1600], gamma=opt.gamma)
  
      errD2plot = []
      errG2plot = []
      errG2norecplot = []
      errG2recplot = []
      D_real2plot = []
      D_fake2plot = []
      D_penality = []
      z_opt2plot = []
  
      for epoch in range(opt.niter):
          if (Gs == []) & (opt.mode != 'SR_train'):
              z_opt = functions.generate_noise([1, opt.nzx, opt.nzy], device=opt.device)
              z_opt = m_noise(z_opt.expand(1, 3, opt.nzx, opt.nzy))
              noise_ = functions.generate_noise([1, opt.nzx, opt.nzy], device=opt.device)
              noise_ = m_noise(noise_.expand(1, 3, opt.nzx, opt.nzy))
          else:
              noise_ = functions.generate_noise([opt.nc_z, opt.nzx, opt.nzy], device=opt.device)
              noise_ = m_noise(noise_)
  
          ############################
          # (1) Update D network: maximize D(x) + D(G(z))
          ###########################
          for j in range(opt.Dsteps):
              # train with real
              netD.zero_grad()
  
              output = netD(real).to(opt.device)
              # D_real_map = output.detach()
              errD_real = -output.mean()  # -a
              errD_real.backward(retain_graph=True)
              D_x = -errD_real.item()
  
              # train with fake
              if (j == 0) & (epoch == 0):
                  if (Gs == []) & (opt.mode != 'SR_train'):
                      prev = torch.full([1, opt.nc_z, opt.nzx, opt.nzy], 0, device=opt.device, dtype=torch.float32)
                      in_s = prev
                      prev = m_image(prev)
                      z_prev = torch.full([1, opt.nc_z, opt.nzx, opt.nzy], 0, device=opt.device, dtype=torch.float32)
                      z_prev = m_noise(z_prev)
                      opt.noise_amp = 1
                  elif opt.mode == 'SR_train':
                      z_prev = in_s
                      criterion = nn.MSELoss()
                      RMSE = torch.sqrt(criterion(real, z_prev))
                      opt.noise_amp = opt.noise_amp_init * RMSE
                      z_prev = m_image(z_prev)
                      prev = z_prev
                  else:
                      prev = draw_concat(Gs, Zs, reals, NoiseAmp, in_s, 'rand', m_noise, m_image, opt)
                      prev = m_image(prev)
                      z_prev = draw_concat(Gs, Zs, reals, NoiseAmp, in_s, 'rec', m_noise, m_image, opt)
                      criterion = nn.MSELoss()
                      RMSE = torch.sqrt(criterion(real, z_prev))
                      opt.noise_amp = opt.noise_amp_init * RMSE
                      z_prev = m_image(z_prev)
              else:
                  prev = draw_concat(Gs, Zs, reals, NoiseAmp, in_s, 'rand', m_noise, m_image, opt)
                  prev = m_image(prev)
  
              if opt.mode == 'paint_train':
                  prev = functions.quant2centers(prev, centers)
                  plt.imsave('%s/prev.png' % (opt.outf), functions.convert_image_np(prev), vmin=0, vmax=1)
  
              if (Gs == []) & (opt.mode != 'SR_train'):
                  noise = noise_
              else:
                  noise = opt.noise_amp * noise_ + prev
  
              fake = netG(noise.detach(), prev)
              output = netD(fake.detach())
              errD_fake = output.mean()
              errD_fake.backward(retain_graph=True)
              D_G_z = output.mean().item()
  
              gradient_penalty = functions.calc_gradient_penalty(netD, real, fake, opt.lambda_grad, opt.device)
              gradient_penalty.backward()
  
              D_penality.append(gradient_penalty)
              errD = errD_real + errD_fake + gradient_penalty
              optimizerD.step()
  
          errD2plot.append(errD.detach())
  
          ############################
          # (2) Update G network: maximize D(G(z))
          ###########################
  
          for j in range(opt.Gsteps):
              netG.zero_grad()
              output = netD(fake)
              # D_fake_map = output.detach()
              errG = -output.mean()
              errG.backward(retain_graph=True)
              if alpha != 0:
                  loss = nn.MSELoss()
                  if opt.mode == 'paint_train':
                      z_prev = functions.quant2centers(z_prev, centers)
                      plt.imsave('%s/z_prev.png' % (opt.outf), functions.convert_image_np(z_prev), vmin=0, vmax=1)
                  Z_opt = opt.noise_amp * z_opt + z_prev
                  rec_loss = alpha * loss(netG(Z_opt.detach(), z_prev), real)
                  rec_loss.backward(retain_graph=True)
                  rec_loss = rec_loss.detach()
              else:
                  Z_opt = z_opt
                  rec_loss = 0
  
              optimizerG.step()
          errG2norecplot.append(errG.detach())
          errG2recplot.append(rec_loss)
          errG2plot.append(errG.detach() + rec_loss)
          D_real2plot.append(D_x)
          D_fake2plot.append(D_G_z)
          z_opt2plot.append(rec_loss)
  
          if epoch % 25 == 0 or epoch == (opt.niter - 1):
              print('scale %d:[%d/%d]' % (len(Gs), epoch, opt.niter))
  
          if epoch % 500 == 0 or epoch == (opt.niter - 1):
              plt.imsave('%s/fake_sample.png' % (opt.outf), functions.convert_image_np(fake.detach()), vmin=0, vmax=1)
              plt.imsave('%s/G(z_opt).png' % (opt.outf),
                         functions.convert_image_np(netG(Z_opt.detach(), z_prev).detach()), vmin=0, vmax=1)
              # plt.imsave('%s/D_fake.png'   % (opt.outf), functions.convert_image_np(D_fake_map))
              # plt.imsave('%s/D_real.png'   % (opt.outf), functions.convert_image_np(D_real_map))
              # plt.imsave('%s/z_opt.png'    % (opt.outf), functions.convert_image_np(z_opt.detach()), vmin=0, vmax=1)
              # plt.imsave('%s/prev.png'     %  (opt.outf), functions.convert_image_np(prev), vmin=0, vmax=1)
              # plt.imsave('%s/noise.png'    %  (opt.outf), functions.convert_image_np(noise), vmin=0, vmax=1)
              # plt.imsave('%s/z_prev.png'   % (opt.outf), functions.convert_image_np(z_prev), vmin=0, vmax=1)
  
              torch.save(z_opt, '%s/z_opt.pth' % (opt.outf))
  
              print('Generator loss:')
              plt.plot(list(range(0, len(errG2plot))), errG2plot)
              plt.show()
              print('Discriminator real loss:')
              plt.plot(list(range(0, len(D_real2plot))), D_real2plot)
              plt.show()
              print('Discriminator fake loss:')
              plt.plot(list(range(0, len(D_fake2plot))), D_fake2plot)
              plt.show()
  
          schedulerD.step()
          schedulerG.step()
  
      functions.save_networks(netG, netD, z_opt, opt)
      return z_opt, in_s, netG

here is where the problem:

@albanD, I’ll be grateful.

albanD · March 15, 2021, 3:00pm

Hey,

Can you follow the instruction that you get in the error message and enable anomaly mode. Then report back here the second stacktrace that you get? Also the full error message (with the failing Tensor size) would be useful.

Ilya_Kotlov · March 15, 2021, 3:07pm

Ok, that is the full error with it’s stack after enabling detect_nomaly:

[W ..\torch\csrc\autograd\python_anomaly_mode.cpp:60] Warning: Error detected in MkldnnConvolutionBackward. Traceback of forward call that caused the error:
  File "F:\JetBrains\PyCharm 2019.3.4\plugins\python\helpers\pydev\pydevd.py", line 2127, in <module>
    main()
  File "F:\JetBrains\PyCharm 2019.3.4\plugins\python\helpers\pydev\pydevd.py", line 2118, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "F:\JetBrains\PyCharm 2019.3.4\plugins\python\helpers\pydev\pydevd.py", line 1427, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "F:\JetBrains\PyCharm 2019.3.4\plugins\python\helpers\pydev\pydevd.py", line 1434, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "F:\JetBrains\PyCharm 2019.3.4\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "F:/PycharmProjects/MySinGan/main_train.py", line 22, in <module>
    train(opt, Gs, Zs, reals, NoiseAmp)
  File "F:\PycharmProjects\MySinGan\SinGAN\training.py", line 54, in train
    z_curr, in_s, G_curr = train_single_scale(D_curr, G_curr, reals, Gs, Zs, in_s, NoiseAmp, opt)
  File "F:\PycharmProjects\MySinGan\SinGAN\training.py", line 171, in train_single_scale
    fake = netG(noise.detach(), prev)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\PycharmProjects\MySinGan\SinGAN\models.py", line 65, in forward
    x = self.tail(x)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\nn\modules\container.py", line 117, in forward
    input = module(input)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\nn\modules\conv.py", line 419, in forward
    return self._conv_forward(input, self.weight)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\nn\modules\conv.py", line 416, in _conv_forward
    self.padding, self.dilation, self.groups)
 (function print_stack)
Traceback (most recent call last):
  File "F:\JetBrains\PyCharm 2019.3.4\plugins\python\helpers\pydev\pydevd.py", line 1434, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "F:\JetBrains\PyCharm 2019.3.4\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "F:/PycharmProjects/MySinGan/main_train.py", line 22, in <module>
    train(opt, Gs, Zs, reals, NoiseAmp)
  File "F:\PycharmProjects\MySinGan\SinGAN\training.py", line 54, in train
    z_curr, in_s, G_curr = train_single_scale(D_curr, G_curr, reals, Gs, Zs, in_s, NoiseAmp, opt)
  File "F:\PycharmProjects\MySinGan\SinGAN\training.py", line 195, in train_single_scale
    errG.backward(retain_graph=True)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "F:\anaconda3\envs\MySinGan\lib\site-packages\torch\autograd\__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3, 32, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I guess the second stack-trace is already in it (correct me if I wrong).

Those are the models of the Generator and Discriminator:


class ConvBlock(nn.Sequential):
    def __init__(self, in_channel, out_channel, ker_size, padd, stride):
        super(ConvBlock,self).__init__()
        self.add_module('conv',nn.Conv2d(in_channel ,out_channel,kernel_size=ker_size,stride=stride,padding=padd)),
        self.add_module('norm',nn.BatchNorm2d(out_channel)),
        self.add_module('LeakyRelu',nn.LeakyReLU(0.2, inplace=True))

def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv2d') != -1:
        m.weight.data.normal_(0.0, 0.02)
    elif classname.find('Norm') != -1:
        m.weight.data.normal_(1.0, 0.02)
        m.bias.data.fill_(0)
   
class WDiscriminator(nn.Module):
    def __init__(self, opt):
        super(WDiscriminator, self).__init__()
        self.is_cuda = torch.cuda.is_available()
        N = int(opt.nfc)
        self.head = ConvBlock(opt.nc_im,N,opt.ker_size,opt.padd_size,1)
        self.body = nn.Sequential()
        for i in range(opt.num_layer-2):
            N = int(opt.nfc/pow(2,(i+1)))
            block = ConvBlock(max(2*N,opt.min_nfc),max(N,opt.min_nfc),opt.ker_size,opt.padd_size,1)
            self.body.add_module('block%d'%(i+1),block)
        self.tail = nn.Conv2d(max(N,opt.min_nfc),1,kernel_size=opt.ker_size,stride=1,padding=opt.padd_size)

    def forward(self,x):
        x = self.head(x)
        x = self.body(x)
        x = self.tail(x)
        return x


class GeneratorConcatSkip2CleanAdd(nn.Module):
    def __init__(self, opt):
        super(GeneratorConcatSkip2CleanAdd, self).__init__()
        self.is_cuda = torch.cuda.is_available()
        N = opt.nfc
        self.head = ConvBlock(opt.nc_im,N,opt.ker_size,opt.padd_size,1) #GenConvTransBlock(opt.nc_z,N,opt.ker_size,opt.padd_size,opt.stride)
        self.body = nn.Sequential()
        for i in range(opt.num_layer-2):
            N = int(opt.nfc/pow(2,(i+1)))
            block = ConvBlock(max(2*N,opt.min_nfc),max(N,opt.min_nfc),opt.ker_size,opt.padd_size,1)
            self.body.add_module('block%d'%(i+1),block)
        self.tail = nn.Sequential(
            nn.Conv2d(max(N,opt.min_nfc),opt.nc_im,kernel_size=opt.ker_size,stride =1,padding=opt.padd_size),
            nn.Tanh()
        )
    def forward(self,x,y):
        x = self.head(x)
        x = self.body(x)
        x = self.tail(x)
        ind = int((y.shape[2]-x.shape[2])/2)
        y = y[:,:,ind:(y.shape[2]-ind),ind:(y.shape[3]-ind)]
        return x+y

Tell if you need the code that calls train_single_scale, because it is a part of the training, but loss computations are done only in that part.

Also maybe you will need calc_gradient_penalty:


def calc_gradient_penalty(netD, real_data, fake_data, LAMBDA, device):
    #print real_data.size()
    alpha = torch.rand(1, 1)
    alpha = alpha.expand(real_data.size())
    alpha = alpha.to(device)#cuda() #gpu) #if use_cuda else alpha

    interpolates = alpha * real_data + ((1 - alpha) * fake_data)


    interpolates = interpolates.to(device)#.cuda()
    interpolates = torch.autograd.Variable(interpolates, requires_grad=True)

    disc_interpolates = netD(interpolates)

    gradients = torch.autograd.grad(outputs=disc_interpolates, inputs=interpolates,
                              grad_outputs=torch.ones(disc_interpolates.size()).to(device),#.cuda(), #if use_cuda else torch.ones(
                                  #disc_interpolates.size()),
                              create_graph=True, retain_graph=True, only_inputs=True)[0]
    #LAMBDA = 1
    gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean() * LAMBDA
    return gradient_penalty

But the code doesn’t fall there.

P.S: because of the dimensions it seems that the problem somewhere in the convolutions (3 on 3 receptive field), but that is awkward (but maybe I’m wrong).

Also it will be probably useful if could somehow trace the version of the tensor which it complain about, but I guess even if I locate because of this the problem, I will not know how to solve it.

If you need more code, feel free to ask, thank you @albanD

Oh, another important thing I forgot to mention:
IT HAPPENS IN THE SECOND ITERATION IN THE MENTIONED PIECE OF CODE

nwn · March 15, 2021, 3:35pm

Hi, you could try to use new variables, e.g. input_mod = module(input) instead of input= module(input). I used to have similar problems, so I just stopped using inplace operations.

Ilya_Kotlov · March 15, 2021, 3:45pm

@nwn, are you talking about the forward path of the models ?

albanD · March 15, 2021, 3:50pm

From the code sample and the error, it seems like the problem is the fake Tensor for which you do the forward in the update D side, then update the D network and re-use that fake variable in update G side.
The problem is that the D network update changed the weight of the conv inplace and so you cannot call backward anymore.

Ilya_Kotlov · March 15, 2021, 4:00pm

@albanD, didn’t I detached it calling backward on errD_fake ?
hm … although it is used in the gradient penalty with backward on ‘fake’.
So you were talking about the fake modification in the ‘gradient penalty’ backward.
But the Generator 3 steps occur only after the previous Discriminator finished it 3 steps,
and the first iteration is successful as I’ve wrote in the clarification, so it isn’t make sense to me.

P.S just a clarrification when the error happens:
The for loop on D.steps finishes it’s 3 steps and then the for loop on Gsteps runs once successfully, and in the second iteration fails with the error (of total 3 steps it suppose to do). All of that in the first epoch (each epoch does 3 steps on each of the models Generator-Discriminator)
.
Any suggestions maybe how to support its behavior when it ran on torch 1.4.0 ? Also will be great.

albanD · March 15, 2021, 4:10pm

didn’t I detached it calling backward on errD_fake ?

That doesn’t change the fact that you re-use the same “fake” in the second loop.
Let me update my comment above slightly:

You generate fake based on netG in the first loop
You use fake in first iteration of second loop
You update netG at the end of the first iteration of second loop
You use again fake (without recomputing it based on updated netG) in the second iteration of the second loop
You try to backward, this fails because the fake you used here is based on the old weights of netG.

Ilya_Kotlov · March 15, 2021, 4:14pm

@albanD, Now I see, the previous weights were updated and now it isn’t correct to backward relatively to them.
Ok, so my question is how to support that code with most similarity to it’s previous behavior on 1.4.0, i.e should I recalculate fake Gstep times, which is logical but pretty significant change relatively to it’s previous behavior or there is another solution ?
What happened in torch 1.4.0 when I did this (maybe I can read about it somewhere or you suggest to compare both torch versions repositories ?)

Thanks.

P.S generally I’ve heard that it is a common practice to do a few steps in one iteration training D and G, nut sure how to do it correctly with torch, especially in this version.

albanD · March 15, 2021, 4:20pm

You can simply add in the second loop fake = netG(noise.detach(), prev)
To recompute fake based on the updated weights at each iteration.

What happened in torch 1.4.0 when I did this

The inplace checks for optimizer.step() were broken back then. We fixed them.

Ilya_Kotlov · March 15, 2021, 4:23pm

Oh, so it meant from the beginning to behave like this, it just was a technical fix. So recalculating fake is the correct solution from the perspective of your initially planned behavior, is that correct ?

albanD · March 15, 2021, 4:28pm

Yes, this was a bug on our end.
Recomputing fake is the right thing to do here yes!