Change the gradient inplace problem in loss.backward()

I am using loss.backward() in my code with an issue about change gradient by an inplace operation.
The loss is calculated using following class. Any help to handle this issue would be appreciated.

import torch
from torch import nn
from torchvision.models.vgg import vgg16
import numpy as np

class GeneratorLoss(nn.Module):
def init(self):
super(GeneratorLoss, self).init()
vgg = vgg16(pretrained=True)
loss_network = nn.Sequential(*list(vgg.features)[:31]).eval()
for param in loss_network.parameters():
param.requires_grad = False
self.loss_network = loss_network
self.mse_loss = nn.MSELoss()
self.tv_loss = TVLoss()

def forward(self, out_images, target_images, class_det_loss, lambda_class):
    # Perception Loss
    perception_loss = self.mse_loss(self.loss_network(out_images), self.loss_network(target_images))
    # Image Loss
    image_loss = self.mse_loss(out_images, target_images)
    # TV Loss
    tv_loss = self.tv_loss(out_images)
    result = torch.tensor(0.0, requires_grad=True)
    result = image_loss + 0.006 * perception_loss + 2e-8 * tv_loss + lambda_class*class_det_loss

    return result

class TVLoss(nn.Module):
def init(self, tv_loss_weight=1):
super(TVLoss, self).init()
self.tv_loss_weight = tv_loss_weight

def forward(self, x):
    batch_size = x.size()[0]
    h_x = x.size()[2]
    w_x = x.size()[3]
    count_h = self.tensor_size(x[:, :, 1:, :])
    count_w = self.tensor_size(x[:, :, :, 1:])
    h_tv = torch.pow((x[:, :, 1:, :] - x[:, :, :h_x - 1, :]), 2).sum()
    w_tv = torch.pow((x[:, :, :, 1:] - x[:, :, :, :w_x - 1]), 2).sum()
    return self.tv_loss_weight * 2 * (h_tv / count_h + w_tv / count_w) / batch_size

def tensor_size(t):
    return t.size()[1] * t.size()[2] * t.size()[3]

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 10]], which is output 0 of TBackward, is at version 2; expected version 1 instead.

Hi @S_M,

Given the shape of your Tensor causing the in-place error is [512,10], I’d assume it’s the output layer that classifies the 10 classes?

  1. Can you check that the vgg16 doesn’t have an in-place torch.nn.ReLU() calls?
  2. remove the result = torch.tensor(0.0, requires_grad=True), which may cause the in-place error and is immediately overwritten.

Also, make sure to correctly define your constructor in both GeneratorLoss and TVLoss classes and initialize the parent class correctly. So change def init(*args) to def __init__(*args) and super(GeneratorLoss, self).init() to super(GeneratorLoss, self).__init__(). An example can be found here.

Hi @AlphaBetaGamma96,
Thank you so much for your reply.
The size of final layer is 512x10 for 10 classes classification problem.

  1. The vgg16 comes from torchvision.models.vgg import vgg16. How can I replace all ReLU(inplace=True) with ReLU(inplace=False) for a pretrained model?
  2. The result = torch.tensor(0.0, requires_grad=True) removed.
    The constructor definiction correctly chenged but the issue didn’t solve.

The only thing that helped me with this issue was switching to an older PyTorch version (1.4). Please be aware, you may run into some compatibility issues with some other packages.

The nn.ReLU(inplace=True) is changed with following code, unfortunately the error didn’t solve.
vgg.features[1] = nn.ReLU(inplace=False)
vgg.features[3] = nn.ReLU(inplace=False)
vgg.features[6] = nn.ReLU(inplace=False)
vgg.features[8] = nn.ReLU(inplace=False)
vgg.features[11] = nn.ReLU(inplace=False)
vgg.features[13] = nn.ReLU(inplace=False)
vgg.features[15] = nn.ReLU(inplace=False)
vgg.features[18] = nn.ReLU(inplace=False)
vgg.features[20] = nn.ReLU(inplace=False)
vgg.features[22] = nn.ReLU(inplace=False)
vgg.features[25] = nn.ReLU(inplace=False)
vgg.features[27] = nn.ReLU(inplace=False)
vgg.features[29] = nn.ReLU(inplace=False)

Reverting pytorch to an older version isn’t an appropriate solution. The problem is that the final output layer of the classifier is being performed in-place for some reason, or perhaps the following layer is being performed in-place because you’re performing reverse-mode AD so the gradient are multiplied in a reverse ordering.

1 Like

Can you run your code within a torch.autograd.set_detect_anomaly context manager? Because that should highlight what’s causing the error. You can find more out torch.autograd.set_detect_anomaly here

An example is to do,

with torch.autograd.set_detect_anomaly():
  #place your code here... (make sure it's all indented correctly)

and make sure to paste in the stacktrace into your reply too as that’ll help me debug your problem.

Also, why are you doing this?

for param in loss_network.parameters():
  param.requires_grad = False

If you want to evaluate your model without update its parameters, you can just run your code within a torch.no_grad() context manager (more detail about that here)

1 Like

Besides what @AlphaBetaGamma96 explained you might also be running into this issue where stale forward activations are used to calculate gradients with the already (inplace) updated parameters. Could you check if your use case is similar or the same?

This might explain the useless results. Unfortunately, I’ve tried almost all the remaining solutions (for weeks) to fix it without success.

The problem was solved by add .detach() to the each part of losses carries gradient and returned back to the main