how to get gradients that still have requires_grad True

yash11 · July 15, 2021, 10:37am

Let w and phi be two parameters

w = Parameter(T.tensor([2.2]))
phi = Parameter(T.tensor([1.5]))
wp = w*phi
wp.backward()
grd = phi.grad
print(grd)

Printed:

tensor([2.2000])

I want:

tensor([2.2000], requires_grad=True)

i.e. I want phi.grad as w which is a parameter of larger network, should have requires_grad=True so I can do

grd.backward()
w.grad

I don’t know how to seperate these two computaion graph.

soulitzer · July 16, 2021, 4:52pm

If I’m understanding correctly, you want to compute the second-order gradients wrt w. So you’d like to have phi.grad itself to have an autograd graph accumulating into w (and thus have requires_grad=True).

You could do this by doing wp.backward(create_graph=True).

yash11 · July 18, 2021, 4:10am

Thanks for your reply.

I actually want gradient of wp only wrt phi, so this worked for me,

w = Parameter(T.tensor([2.2]))
phi = Parameter(T.tensor([1.5]))
wp = w*phi
grd = T.autograd.grad(wp, phi, create_graph=True)[0]
print(grd)
grd.backward()
w.grad
print(w.grad)

output:

tensor([2.2000], grad_fn=<MulBackward0>)
tensor([1.])

Using modified last method

w = Parameter(T.tensor([2.2]))
phi = Parameter(T.tensor([1.5]))
wp = w*phi
wp.backward(create_graph=True)
grd = phi.grad
print(grd)
grd.backward()
w.grad
print(w.grad)

output:

tensor([2.2000], grad_fn=<CopyBackwards>)
tensor([2.5000], grad_fn=<CopyBackwards>)

I don’t know what is going on with last method.
Also I found a Quote, which suggest against using .grad in such cases.

Second order derivatives of loss function

If you have a single input and single output, you want to do the following:
(note that using .backward() for higher order derivatives is discouraged because the .grad field becomes hard to reason about).
first_derivative = autograd.grad(loss, x, create_graph=True)[0]
# We now have dloss/dx
second_derivative = autograd.grad(first_derivative, x)[0]
# This computes d/dx(dloss/dx) = d2loss/dx2

soulitzer · July 21, 2021, 1:33am

I believe the quote is saying that .backward() is hard to reason about (not .grad()). .grad() is actually the preferred alternative because (by default) it’s more explicit about what inputs its computing gradients for. Its also returns the gradient instead of performing a side effect like updating .grad.