Gradients altered when adding 0

Inside a function when training my PINN (dealing with first and second derivatives of MLP(x) wrt. x), adding any of the lines below (or any subset of them) alters the gradients tracked in variable x without altering the values of x.

x = x + 0
x = x + 0.
x = x + torch.tensor([0], dtype=torch.float64)
x = x.view(*x.shape)
x, = x.T[:, :, None]

Variable x has shape [200, 1], dtype=torch.float64, requires_grad=True, and device=‘cpu’. I am using double precision (torch.set_default_dtype(torch.float64) at the beginning of the main.py script).

How is this possible?

Hi Salvado!

All of those manipulations create a new tensor (that might share storage with the original
tensor). This could cause the gradients to change in a variety of ways depending on what
the rest of your code does.

Please post a super-simple, fully-self-contained, runnable example that reproduces what
you are seeing, together with the result you get when you run it. We need to see the rest
of your (simplified, minimal) code to be able to explain what is happening.

Best.

K. Frank

Dear K. Frank,

Thank you very much for your reply. Here is a minimal example that illustrates what I mean:

import torch

# torch.set_default_dtype(torch.float64)  # same behaviour

torch.manual_seed(0)

x = torch.rand(10, 1).requires_grad_()
u = x
# x = x + 0.  # Comment/uncomment to compare
u = u * x**2
u_x = torch.autograd.grad(u, x,
                          grad_outputs=torch.ones_like(u),
                          create_graph=True)[0]
print(torch.mean(u_x).item())
print(torch.mean(3*x**2).item())  # Analytical solution

If I keep the line x=x+0. commented, the output is:

0.9028419256210327
0.9028419256210327

If I uncomment it, the output is:

0.6018945574760437
0.9028419256210327

Best,

Marc

Hi Salvado!

When you uncomment x = x + 0, you create a new tensor, but reuse the python name x
to refer to it. (Before you execute u = u * x**2, the python name u refers to the original x
tensor, but after you execute that line, you no longer have a reference to the original x tensor.
Autograd, however, keeps a reference to the original x tensor because it is in the computation
graph.)

Your call to autograd.grad() differentiates (the new) u with respect to (the new) x. As far
as this bit of calculus is concerned, the old u doesn’t depend on the new x, so autograd
differentiates constant * x**2 with respect to x, giving constant * 2 * x.

Here is a tweaked version of your example that shows that you get your expected result if you
differentiate u * x**2 with respect to the original x:

import torch
print (torch.__version__)

x = torch.tensor ([3.], requires_grad = True)
u = x                                      # the names u and x refer to the same tensor
v = u * x**2                               # the same as x * x**2
v_x = torch.autograd.grad (v, x)           # compute grad with respect to x -- expected result
print ('v_x[0]:', v_x[0])

x = torch.tensor ([3.], requires_grad = True)
u = x
x = x + 0.                                 # create a new tensor and use the name x to refer to it -- u and x now refer to different tensors
v = u * x**2
v_x = torch.autograd.grad (v, x)           # compute grad with respect to the new tensor -- unexpected result?
print ('v_x[0]:', v_x[0])

x = torch.tensor ([3.], requires_grad = True)
u = x
new_x = x + 0.                             # create a new tensor and use a new name, new_x, to refer to it -- u and x still refer to the same tensor
v = u * new_x**2
v_new_x = torch.autograd.grad (v, new_x)   # compute grad with respect to the new tensor -- this result should be expected   
print ('v_new_x[0]:', v_new_x[0])

x = torch.tensor ([3.], requires_grad = True)
u = x
new_x = x + 0.                             # create a new tensor and use a new name, new_x, to refer to it -- u and x still refer to the same tensor
v = u * new_x**2
v_x = torch.autograd.grad (v, x)           # compute grad with respect to the original tensor -- expected result
print ('v_x[0]:', v_x[0])

And here is its output:

2.9.0+cu130
v_x[0]: tensor([27.])
v_x[0]: tensor([18.])
v_new_x[0]: tensor([18.])
v_x[0]: tensor([27.])

It’s the fourth version where we do create a “new” x, but differentiate with respect to the
original x. (We do this by using a new python name, new_x to refer to the new x so that
we can continue to use the python name xto refer to the original x, which we then use to
tell autograd to differentiate with respect to the original x.)

Just to underscore what’s going on, let’s look at the third version:

    v_new_x  =  d (u * new_x**2) / d new_x
             =  (d u / d new_x) * new_x**2  +  u * (d new_x**2 / d new_x)

u does not depend (explicitly) on new_x, so d u / d new_x = 0. (Because new_x is a
function of (the original) x – which is uu depends implicitly on new_x, but autograd
doesn’t track such implicit dependencies (and you wouldn’t want it to).)

Using d u / d new_x = 0, we get:

    v_new_x  =  u * (d new_x**2 / d new_x)
             =  u * 2 * new_x
             =  x * 2 * x   (using the values of u and new_x, not their functional dependencies)
             =  2 * x**2

So – taking into account that autograd is differentiating with respect to new_x – we see that
autograd is producing the correct result.

Best.

K. Frank

Hi K. Frank,

Thank you very much for your explanation. It is all clear now.

Best regards,

Marc

Hi K. Frank,

I am now finding a similar problem but without the intermediate variable x_new. The following code produces different values depending on which line is commented:

import torch

torch.manual_seed(0)

x = torch.rand(10, 1).requires_grad_()

# Compare the following two lines:
# u = x * x * (1. - x)              # Further output: -0.9405761957168579
# u = x * (x - 0.) * (1. - x)       # Further output: -0.9405760765075684

u_x = torch.autograd.grad(u, x,
                          grad_outputs=torch.ones_like(u),
                          create_graph=True)[0]
u_xx = torch.autograd.grad(u_x, x,
                          grad_outputs=torch.ones_like(u_x),
                          create_graph=True)[0]

print(torch.mean(u_xx).item())
print(torch.mean(2 - 6*x).item())  # -0.9405760765075684

In this case, the analytical second derivative is equal to deriving twice x(x-0.)(1.-x), but not equal to deriving twice x·x·(1.-x). Is that due to numerical error, or am I still missing something?

Adding torch.set_default_dtype(torch.float64) solves this example, but not my original code, where u=model(x) instead of u=x, and which runs for several epochs.

Thank you for your help and best regards,

Marc

Hi Marc!

Yes, these two results differ by typical single-precision round-off error, so this can be
reasonably expected.

This further demonstrates that this is just round-off error. When you switch to double precision
from single precision and see your discrepancy reduced to that typical of double precision, it’s
good evidence that you’re seeing the effects of round of error. (The discrepancy you see using
double precision can certainly, by happenstance, turn out to be zero.)

This is also to be expected. When you run a short computation, you will normally only see
discrepancies consistent with the precision you are using (but “unstable” or ill-conditioned
computations can amplify them). However, when you run a longer computation, round-off
error can accumulate.

When you train a model, you are moving along a path in the parameter space of the model.
A little bit of round-off error after the first step will cause the next step to go in a slightly
different direction. After a while, the paths you take will diverge (especially after you train
for multiple epochs) and you’ll get results that are just different – not just within some multiple
of round-off error of one another. (A similar effect occurs when you train a model starting
with different – but statistically equivalent – random initializations.)

It’s not that one training path or result is right and the other is wrong – they’re just different
and are showing different, but broadly equivalent, effects of finite numerical precision that
have accumulated over the course of a lengthy computation.

Best.

K. Frank

Hi K. Frank,

Thank you very much for your detailed explanation. It was very useful.

Best regards,

Marc