Why the gradient values seems to be reversed in Tensor.backward()

nadeeer · January 2, 2025, 1:55pm

I am trying to print the gradient values of 3 tensors but I find the printed gradients do not match my manual calculation, specifically the gradient of a is swapped with the gradient of c. Check the code below.

import torch
a = torch.tensor(3.0,requires_grad=True)
b = a*2
c = b ** 2

b.retain_grad()
c.retain_grad()
c.backward() # Computes the gradient of current tensor wrt graph leaves.
print(a.grad)
print(b.grad)
print(c.grad)

Here is the output:

tensor(24.)
tensor(12.)
tensor(1.)

soulitzer · January 2, 2025, 3:58pm

These results seem expected to me.

c = (a*2)**2 = 4*a**2
dc/da = 8a = 8(3) = 24

nadeeer · January 2, 2025, 6:57pm

So you’ve calculated dc/da which is the gradient value for c, and I agree that 24 is the correct answer. Though, when I print c.grad the output is 1 and confusingly the print of a.grad is 24. Do you see where I am confused?

soulitzer · January 2, 2025, 9:05pm

Yes, that would be strange.
I see though from your original post that the prints are correct?

nadeeer · January 2, 2025, 9:28pm

They don’t seem to be correct as in they’re not matching the manual calculation me and you did. Here is what I am getting:
print(a.grad) → tensor(24.)
print(b.grad) → tensor(12.)
print(c.grad) → tensor(1.)

soulitzer · January 2, 2025, 9:45pm

So you’ve calculated dc/da which is the gradient value for c,

a.grad actually means dc/da (and c.grad means dc/dc, which is 1)

When using .backward() we are doing backprop not forwardprop (which PyTorch also supports Forward-mode Automatic Differentiation (Beta) — PyTorch Tutorials 2.5.0+cu124 documentation)

nadeeer · January 2, 2025, 10:10pm

That makes sense and resolves the confusion!

a.grad actually means dc/da (and c.grad means dc/dc, which is 1)

Any where in the documentation that mentions this information? I never came across it before.

soulitzer · January 2, 2025, 10:42pm

Not sure it is mentioned explicitly, possibly because c.retains_grad(); c.backward() is relatively rare to do (e.g. loss.grad is never populated) so usually it is harder to confuse.

soulitzer · January 2, 2025, 10:43pm

The common case is more like: I have a loss and many parameter, and its obvious that the gradient of the loss wrt each of params is stored in each of param’s grad fields.

nadeeer · January 2, 2025, 11:05pm

Probably. I’ve wrote the above code just to understand how the gradients are calculated in .

This was a helpful discussion, I appreciate your insights!