Why are gradients given by Pytorch 0.4.0 and 0.4.1 are different when backward?

import torch
import torch.nn as nn

x = torch.ones([1], requires_grad=True)
w = torch.tensor([0.2], requires_grad=True)
print('x====: {}'.format(x))
print('w====: {}'.format(w))

def f(x):
    x = x.cuda()
    return torch.pow(x, 2).sum()
    # return x*x*x.sum()

def SGD(grad, lr=0.2):
    return -lr*grad

def optimizer(grad):
    return -w*grad

sum_losses = 0

for i in range(2):

    loss = f(x)
    # print(i, loss)

    sum_losses += loss
    loss.backward(torch.ones_like(loss), retain_graph=True)
    print('x.grad: {}'.format(x.grad))
    print('w1.grad: {}'.format(w.grad))

    update = optimizer(x.grad)
    x = x + update
    print('x-:{}'.format(x))
    print('x-.grad: {}'.format(x.grad))

    x.retain_grad()
    update.retain_grad()

sum_losses.backward()
print('w.grad: {}'.format(w.grad))

w_update = SGD(w.grad, lr=0.1)
w = w + w_update
print('w====: {}'.format(w))

Pytorch 0.4.1 print as follow:

x====: tensor([1.], requires_grad=True)
w====: tensor([0.2000], requires_grad=True)
x.grad: tensor([2.])
w1.grad: None
x-:tensor([0.6000], grad_fn=<ThAddBackward>)
x-.grad: None
x.grad: tensor([1.2000])
w1.grad: tensor([-3.8400])
x-:tensor([0.3600], grad_fn=<ThAddBackward>)
x-.grad: None
w.grad: tensor([-7.6800])
w====: tensor([0.9680], grad_fn=<ThAddBackward>)

Pytorch0.4.0 print as follow:

x====: tensor([ 1.])
w====: tensor([ 0.2000])
x.grad: tensor([ 2.])
w1.grad: None
x-:tensor([ 0.6000])
x-.grad: None
x.grad: tensor([ 1.2000])
w1.grad: tensor([-2.4000])
x-:tensor([ 0.3600])
x-.grad: None
w.grad: tensor([-6.2400])
w====: tensor([ 0.8240])

I change function optimizer as follow and the problem is solved, but still confused.

def optimizer(grad):
    return w*(-grad)

It looks like your colleague posted the same question here.
I would ask to only keep one topic alive and keep all answers there.

Hi,

Have you tried running this with a more recent version of pytorch.
Which result is the expected one?

Hi, I’ ve tried running the code with Pytorch 1.0.0. The print resuls are the same with Pytorch 0.4.1.

This post was flagged by the community and is temporarily hidden.

Running your code with the latest pytorch raises:

x-:tensor([0.6000], grad_fn=<AddBackward0>)
x-.grad: None
Traceback (most recent call last):
  File "foo.py", line 28, in <module>
    loss.backward(torch.ones_like(loss), retain_graph=True)
  File "torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

The problem is that the way the x.grad is updated during the forward in old version was done with the unsafe .data. And so the inplace operation was not properly detected.
This is a good example where the use of .data is dangerous and should be replaced by .detach() or with torch.no_grad().

You can fix your current code by doing this in old versions:

def optimizer(grad):
    return -w*grad.clone()

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

I want to know why report the error in the the latest pytorch.
Why this error caused by update x.grad, I think this caused by x = x + update. Because loss.backward() calculate the grad of x. So I think change x cause the error.

Why I change to return w * (-grad), the code also can execution.

I fix the function optimizer as follows and it works, but i’m still confused why they works

def optimizer(grad):
    return w*(-grad)

HI,

The problem is that the multiplication needs the values of its operands to compute the backpass.
If for any reason this operand is modified inplace, the computed gradient will be wrong.
The old implementation that was using .data for gradient accumulation was not notifying the autograd of the inplace operation and thus the gradient were wrong.
The new implementation that uses torch.no_grad() does notify the autograd and so throws an error.

Both my suggestion with .clone() and your change to do -grad make a copy of grad before passing it to the multiplication. Thus when grad is modified inplace, it does not modify the value needed by the multiplication to compute its backward.
So this will compute the correct gradient.

Thanks. I want to make sure my understanding is correct.
1.
In cycle, the first loss calculate by loss = f(x), but the following loss calculate by

update = optimizer(x.grad)
x = x + update
loss = f(x)

Every tensor keeps a version counter. When exec loss.backward, the Function save new version counter of the tensor, and check in backward.
when exec loss.backward, calculate the grad from bottom, so x.grad change. but not yet calculate the grad about -w*grad, so the program report error.

In cycle, autograd records a new graph every time. If I change loss.backward(torch.ones_like(loss), retain_graph=True) to loss.backward(torch.ones_like(loss), retain_graph=False), the program not report error in cycle. But sum_losses rely on all subgraph that recorded in cycle. So sum_losses.backward() will report error.

Hi,

I’m not sure to understand what you mean here.
The main point is that loss.backward() used to modified x.grad in an unsafe way. So if an operation used x.grad, then the wrong behavior you observe will happen.