Why are gradients given by Pytorch 0.4.0 and 0.4.1 are different when backward?

Arenops · November 19, 2019, 7:04am

import torch
import torch.nn as nn

x = torch.ones([1], requires_grad=True)
w = torch.tensor([0.2], requires_grad=True)
print('x====: {}'.format(x))
print('w====: {}'.format(w))

def f(x):
    x = x.cuda()
    return torch.pow(x, 2).sum()
    # return x*x*x.sum()

def SGD(grad, lr=0.2):
    return -lr*grad

def optimizer(grad):
    return -w*grad

sum_losses = 0

for i in range(2):

    loss = f(x)
    # print(i, loss)

    sum_losses += loss
    loss.backward(torch.ones_like(loss), retain_graph=True)
    print('x.grad: {}'.format(x.grad))
    print('w1.grad: {}'.format(w.grad))

    update = optimizer(x.grad)
    x = x + update
    print('x-:{}'.format(x))
    print('x-.grad: {}'.format(x.grad))

    x.retain_grad()
    update.retain_grad()

sum_losses.backward()
print('w.grad: {}'.format(w.grad))

w_update = SGD(w.grad, lr=0.1)
w = w + w_update
print('w====: {}'.format(w))

Pytorch 0.4.1 print as follow:

x====: tensor([1.], requires_grad=True)
w====: tensor([0.2000], requires_grad=True)
x.grad: tensor([2.])
w1.grad: None
x-:tensor([0.6000], grad_fn=<ThAddBackward>)
x-.grad: None
x.grad: tensor([1.2000])
w1.grad: tensor([-3.8400])
x-:tensor([0.3600], grad_fn=<ThAddBackward>)
x-.grad: None
w.grad: tensor([-7.6800])
w====: tensor([0.9680], grad_fn=<ThAddBackward>)

Pytorch0.4.0 print as follow:

x====: tensor([ 1.])
w====: tensor([ 0.2000])
x.grad: tensor([ 2.])
w1.grad: None
x-:tensor([ 0.6000])
x-.grad: None
x.grad: tensor([ 1.2000])
w1.grad: tensor([-2.4000])
x-:tensor([ 0.3600])
x-.grad: None
w.grad: tensor([-6.2400])
w====: tensor([ 0.8240])

I change function optimizer as follow and the problem is solved, but still confused.

def optimizer(grad):
    return w*(-grad)

ptrblck · November 19, 2019, 7:08am

It looks like your colleague posted the same question here.
I would ask to only keep one topic alive and keep all answers there.

albanD · November 19, 2019, 3:00pm

Hi,

Have you tried running this with a more recent version of pytorch.
Which result is the expected one?

Arenops · November 20, 2019, 3:17am

Hi, I’ ve tried running the code with Pytorch 1.0.0. The print resuls are the same with Pytorch 0.4.1.

albanD · November 20, 2019, 2:26pm

Running your code with the latest pytorch raises:

x-:tensor([0.6000], grad_fn=<AddBackward0>)
x-.grad: None
Traceback (most recent call last):
  File "foo.py", line 28, in <module>
    loss.backward(torch.ones_like(loss), retain_graph=True)
  File "torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

The problem is that the way the x.grad is updated during the forward in old version was done with the unsafe .data. And so the inplace operation was not properly detected.
This is a good example where the use of .data is dangerous and should be replaced by .detach() or with torch.no_grad().

You can fix your current code by doing this in old versions:

def optimizer(grad):
    return -w*grad.clone()

as754770178 · November 22, 2019, 12:45am

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

as754770178 · November 22, 2019, 1:19am

I want to know why report the error in the the latest pytorch.
Why this error caused by update x.grad, I think this caused by x = x + update. Because loss.backward() calculate the grad of x. So I think change x cause the error.

Why I change to return w * (-grad), the code also can execution.

Arenops · November 22, 2019, 1:41am

I fix the function optimizer as follows and it works, but i’m still confused why they works

def optimizer(grad):
    return w*(-grad)

albanD · November 22, 2019, 2:42pm

HI,

The problem is that the multiplication needs the values of its operands to compute the backpass.
If for any reason this operand is modified inplace, the computed gradient will be wrong.
The old implementation that was using .data for gradient accumulation was not notifying the autograd of the inplace operation and thus the gradient were wrong.
The new implementation that uses torch.no_grad() does notify the autograd and so throws an error.

Both my suggestion with .clone() and your change to do -grad make a copy of grad before passing it to the multiplication. Thus when grad is modified inplace, it does not modify the value needed by the multiplication to compute its backward.
So this will compute the correct gradient.

as754770178 · December 2, 2019, 6:32am

Thanks. I want to make sure my understanding is correct.
1.
In cycle, the first loss calculate by loss = f(x), but the following loss calculate by

update = optimizer(x.grad)
x = x + update
loss = f(x)

Every tensor keeps a version counter. When exec loss.backward, the Function save new version counter of the tensor, and check in backward.
when exec loss.backward, calculate the grad from bottom, so x.grad change. but not yet calculate the grad about -w*grad, so the program report error.

In cycle, autograd records a new graph every time. If I change loss.backward(torch.ones_like(loss), retain_graph=True) to loss.backward(torch.ones_like(loss), retain_graph=False), the program not report error in cycle. But sum_losses rely on all subgraph that recorded in cycle. So sum_losses.backward() will report error.

albanD · December 2, 2019, 3:28pm

Hi,

I’m not sure to understand what you mean here.
The main point is that loss.backward() used to modified x.grad in an unsafe way. So if an operation used x.grad, then the wrong behavior you observe will happen.

UmEr_ANsari · June 21, 2021, 5:50am

I am trying to transform a code written in PyTorch 0.2 to PyTorch version 1.7.
Somehow I managed to run the code in PyTorch 1.7 but the results are different from the original code.
I believe the loss computed in these two versions is different.

The link of the code is written in PyTorch 0.2:

github.com

cai-lw/KBGAN/blob/master/base_model.py

import torch
import torch.nn as nn
import torch.nn.functional as nnf
from config import config
from torch.autograd import Variable
from torch.optim import Adam
from metrics import mrr_mr_hitk
from data_utils import batch_by_size
import logging


class BaseModule(nn.Module):
    def __init__(self):
        super(BaseModule, self).__init__()

    def score(self, src, rel, dst):
        raise NotImplementedError

    def dist(self, src, rel, dst):
        raise NotImplementedError

This file has been truncated. show original

Can you suggest changes in gen_step() and dis_step() in the base_model.py file in order to transform it to 1.7?
Thanks