I got 'RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation' error

coincheung · February 3, 2019, 11:47am

I am trying online hard mining, and my simplified code is like this:

class OhemCELoss(nn.Module):
    def __init__(self, thresh, n_min, ignore_lb=255, *args, **kwargs):
        super(OhemCELoss, self).__init__()
        self.thresh = thresh
        self.n_min = n_min
        self.ignore_lb = ignore_lb
        self.criteria = nn.CrossEntropyLoss(ignore_index=ignore_lb)

    def forward(self, logits, labels):
        N, C, H, W = logits.size()
        n_pixs = N * H * W
        logits = logits.permute(0, 2, 3, 1).contiguous().view(-1, C)
        scores = F.softmax(logits, dim=1).cpu()
        labels = labels.view(-1)
        labels_cpu = labels.cpu()
        invalid_mask = labels_cpu==self.ignore_lb
        labels_cpu[invalid_mask] = 0
        picks = scores[torch.arange(n_pixs), labels_cpu]
        picks[invalid_mask] = 1
        sorteds, inds = torch.sort(picks)
        thresh = self.thresh if sorteds[self.n_min]<self.thresh else sorteds[n_min]
        labels[picks>thresh] = self.ignore_lb
        loss = self.criteria(logits, labels)
        return loss


if __name__ == '__main__':
    criteria1 = OhemCELoss(thresh=0.7, n_min=16*20*20//16).cuda()
    criteria2 = OhemCELoss(thresh=0.7, n_min=16*20*20//16).cuda()
    net1 = nn.Sequential(
        nn.Conv2d(3, 19, kernel_size=3, stride=2, padding=1),
    )
    net1.cuda()
    net1.train()
    net2 = nn.Sequential(
        nn.Conv2d(3, 19, kernel_size=3, stride=2, padding=1),
    )
    net2.cuda()
    net2.train()

    inten = torch.randn(16, 3, 20, 20).cuda()
    lbs = torch.randint(0, 19, [16, 20, 20]).cuda()
    lbs[1, 10, 10] = 255

    logits1 = net1(inten)
    logits1 = F.interpolate(logits1, inten.size()[2:], mode='bilinear')
    logits2 = net2(inten)
    logits2 = F.interpolate(logits2, inten.size()[2:], mode='bilinear')

    loss1 = criteria1(logits1, lbs)
    loss2 = criteria2(logits2, lbs)
    loss = loss1 + loss2
    loss.backward()

With this code I got the error message of:

Traceback (most recent call last):
  File "loss.py", line 79, in <module>
    loss.backward()
  File "/home/zhangzy/.local/lib/python3.5/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/zhangzy/.local/lib/python3.5/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Where is the inplace operation that causes this error please?

JuanFMontesinos · February 3, 2019, 1:03pm

labels_cpu[invalid_mask] = 0

You arr masking them

coincheung · February 3, 2019, 3:00pm

Would you please show me how to fix this? I changed it like this:

  def forward(self, logits, labels):
        N, C, H, W = logits.size()
        n_pixs = N * H * W
        logits = logits.permute(0, 2, 3, 1).contiguous().view(-1, C)
        with torch.no_grad():
            scores = F.softmax(logits, dim=1).cpu().detach()
            labels = labels.view(-1)
            labels_cpu = labels.cpu().detach()
            invalid_mask = labels_cpu==self.ignore_lb
            labels_cpu[invalid_mask] = 0
            picks = scores[torch.arange(n_pixs), labels_cpu]
            picks[invalid_mask] = 1
            sorteds, inds = torch.sort(picks)
            thresh = self.thresh if sorteds[self.n_min]<self.thresh else sorteds[n_min]
            labels[picks>thresh] = self.ignore_lb
        loss = self.criteria(logits, labels)
        return loss

But the problem still exists

JuanFMontesinos · February 3, 2019, 3:11pm

It’s not a matter of using that. I don’t know the content of those matrices but I expect you understand that masking is not backpropagable, thus, learnable.

Problem there is that when you compute labels_cpu you are moving labels to gpu but it still refers to labels in memory.
You have to use
Labels.cpu().clone().detach()

So depending on what labels[picks>thresh]= self.ignore does you may not be able to fix it.

If you only want to compute loss over those features try to mask both, gt and labels, not to modify them

coincheung · February 3, 2019, 3:20pm

I just need to change the labels according to the scores of the features. The features that are used to compute the loss are not modified when they are fed to the cross_entropy. Since labels are input tensors, their values should be easily to be modified.
I change the associated lines into:

            scores = F.softmax(logits, dim=1).cpu().clone().detach()
            labels_cpu = labels.cpu().clone().detach()

But the problem still exists. How could I make it work please?

JuanFMontesinos · February 3, 2019, 3:23pm

What about this Line ?
labels[picks>thresh] = self.ignore_lb

You are assigning a value in place there, right?

coincheung · February 3, 2019, 3:25pm

Yes, I have just found that without this line , the code can work. I just need to change some value of the labels into the ignored value. I tried to use labels.requires_grad=False, but it still cannot work. Why I cannot modify the labels please?

coincheung · February 3, 2019, 3:29pm

I tried this:

         labels[picks>thresh] = self.ignore_lb
        labels = labels.clone().detach()

and I works, I still cannot understand why the values of training labels cannot be changed .

JuanFMontesinos · February 3, 2019, 3:45pm

Because deep learning is based on computing gradients.
If you just manually change the value of a tensor you cannot compute gradient because it simply does not exist.
The closest thing you can do is
Criteria(labels[p>t],scores[p>t])
You can compute the loss over those values, but you cannot modify the tensor unless you apply a continuous function over those values

coincheung · February 3, 2019, 4:02pm

I see. Many thanks !

pgu-nd · August 17, 2019, 5:22am

Could you please show me how to fix this problem? I changed labels = labels.clone() to labels = labels.clone().detach(), but it still not works. Thank you so much.