Inside a function when training my PINN (dealing with first and second derivatives of MLP(x) wrt. x), adding any of the lines below (or any subset of them) alters the gradients tracked in variable x without altering the values of x.
x = x + 0
x = x + 0.
x = x + torch.tensor([0], dtype=torch.float64)
x = x.view(*x.shape)
x, = x.T[:, :, None]
Variable x has shape [200, 1], dtype=torch.float64, requires_grad=True, and device=‘cpu’. I am using double precision (torch.set_default_dtype(torch.float64) at the beginning of the main.py script).
All of those manipulations create a new tensor (that might share storage with the original
tensor). This could cause the gradients to change in a variety of ways depending on what
the rest of your code does.
Please post a super-simple, fully-self-contained, runnable example that reproduces what
you are seeing, together with the result you get when you run it. We need to see the rest
of your (simplified, minimal) code to be able to explain what is happening.
Thank you very much for your reply. Here is a minimal example that illustrates what I mean:
import torch
# torch.set_default_dtype(torch.float64) # same behaviour
torch.manual_seed(0)
x = torch.rand(10, 1).requires_grad_()
u = x
# x = x + 0. # Comment/uncomment to compare
u = u * x**2
u_x = torch.autograd.grad(u, x,
grad_outputs=torch.ones_like(u),
create_graph=True)[0]
print(torch.mean(u_x).item())
print(torch.mean(3*x**2).item()) # Analytical solution
If I keep the line x=x+0. commented, the output is:
When you uncomment x = x + 0, you create a new tensor, but reuse the python name x
to refer to it. (Before you execute u = u * x**2, the python name u refers to the original x
tensor, but after you execute that line, you no longer have a reference to the original x tensor.
Autograd, however, keeps a reference to the original x tensor because it is in the computation
graph.)
Your call to autograd.grad() differentiates (the new) u with respect to (the new) x. As far
as this bit of calculus is concerned, the old u doesn’t depend on the new x, so autograd
differentiates constant * x**2 with respect to x, giving constant * 2 * x.
Here is a tweaked version of your example that shows that you get your expected result if you
differentiate u * x**2 with respect to the originalx:
import torch
print (torch.__version__)
x = torch.tensor ([3.], requires_grad = True)
u = x # the names u and x refer to the same tensor
v = u * x**2 # the same as x * x**2
v_x = torch.autograd.grad (v, x) # compute grad with respect to x -- expected result
print ('v_x[0]:', v_x[0])
x = torch.tensor ([3.], requires_grad = True)
u = x
x = x + 0. # create a new tensor and use the name x to refer to it -- u and x now refer to different tensors
v = u * x**2
v_x = torch.autograd.grad (v, x) # compute grad with respect to the new tensor -- unexpected result?
print ('v_x[0]:', v_x[0])
x = torch.tensor ([3.], requires_grad = True)
u = x
new_x = x + 0. # create a new tensor and use a new name, new_x, to refer to it -- u and x still refer to the same tensor
v = u * new_x**2
v_new_x = torch.autograd.grad (v, new_x) # compute grad with respect to the new tensor -- this result should be expected
print ('v_new_x[0]:', v_new_x[0])
x = torch.tensor ([3.], requires_grad = True)
u = x
new_x = x + 0. # create a new tensor and use a new name, new_x, to refer to it -- u and x still refer to the same tensor
v = u * new_x**2
v_x = torch.autograd.grad (v, x) # compute grad with respect to the original tensor -- expected result
print ('v_x[0]:', v_x[0])
It’s the fourth version where we do create a “new” x, but differentiate with respect to the originalx. (We do this by using a new python name, new_x to refer to the new x so that
we can continue to use the python name xto refer to the original x, which we then use to
tell autograd to differentiate with respect to the original x.)
Just to underscore what’s going on, let’s look at the third version:
v_new_x = d (u * new_x**2) / d new_x
= (d u / d new_x) * new_x**2 + u * (d new_x**2 / d new_x)
u does not depend (explicitly) on new_x, so d u / d new_x = 0. (Because new_x is a
function of (the original) x – which is u – u depends implicitly on new_x, but autograd
doesn’t track such implicit dependencies (and you wouldn’t want it to).)
Using d u / d new_x = 0, we get:
v_new_x = u * (d new_x**2 / d new_x)
= u * 2 * new_x
= x * 2 * x (using the values of u and new_x, not their functional dependencies)
= 2 * x**2
So – taking into account that autograd is differentiating with respect to new_x – we see that
autograd is producing the correct result.
I am now finding a similar problem but without the intermediate variable x_new. The following code produces different values depending on which line is commented:
import torch
torch.manual_seed(0)
x = torch.rand(10, 1).requires_grad_()
# Compare the following two lines:
# u = x * x * (1. - x) # Further output: -0.9405761957168579
# u = x * (x - 0.) * (1. - x) # Further output: -0.9405760765075684
u_x = torch.autograd.grad(u, x,
grad_outputs=torch.ones_like(u),
create_graph=True)[0]
u_xx = torch.autograd.grad(u_x, x,
grad_outputs=torch.ones_like(u_x),
create_graph=True)[0]
print(torch.mean(u_xx).item())
print(torch.mean(2 - 6*x).item()) # -0.9405760765075684
In this case, the analytical second derivative is equal to deriving twice x(x-0.)(1.-x), but not equal to deriving twice x·x·(1.-x). Is that due to numerical error, or am I still missing something?
Adding torch.set_default_dtype(torch.float64) solves this example, but not my original code, where u=model(x) instead of u=x, and which runs for several epochs.
Yes, these two results differ by typical single-precision round-off error, so this can be
reasonably expected.
This further demonstrates that this is just round-off error. When you switch to double precision
from single precision and see your discrepancy reduced to that typical of double precision, it’s
good evidence that you’re seeing the effects of round of error. (The discrepancy you see using
double precision can certainly, by happenstance, turn out to be zero.)
This is also to be expected. When you run a short computation, you will normally only see
discrepancies consistent with the precision you are using (but “unstable” or ill-conditioned
computations can amplify them). However, when you run a longer computation, round-off
error can accumulate.
When you train a model, you are moving along a path in the parameter space of the model.
A little bit of round-off error after the first step will cause the next step to go in a slightly
different direction. After a while, the paths you take will diverge (especially after you train
for multiple epochs) and you’ll get results that are just different – not just within some multiple
of round-off error of one another. (A similar effect occurs when you train a model starting
with different – but statistically equivalent – random initializations.)
It’s not that one training path or result is right and the other is wrong – they’re just different
and are showing different, but broadly equivalent, effects of finite numerical precision that
have accumulated over the course of a lengthy computation.