Why cant I see .grad of an intermediate variable?

Kalamaya · January 21, 2017, 1:34am

Hi! I am loving this framework…

Since Im a noob, I am probably not getting something, but I am wondering why I cant get the gradient of an intermediate variable with .grad? Here is an example of what I mean:

xx = Variable(torch.randn(1,1), requires_grad = True)
yy = 3*xx
zz = yy**2
zz.backward()
xx.grad # This is ok
yy.grad # This gives 0! 
zz.grad # This should give 1!

So I get the correct result for xx.grad, but why does yy.grad show 0, as does zz.grad? How can I get the yy.grad value in this case?

Thanks!

smth · January 21, 2017, 4:10am

Hi Kalamaya,

By default, gradients are only retained for leaf variables. non-leaf variables’ gradients are not retained to be inspected later. This was done by design, to save memory.

However, you can inspect and extract the gradients of the intermediate variables via hooks.
You can register a function on a Variable that will be called when the backward of the variable is being processed.

More documentation on hooks is here: http://pytorch.org/docs/autograd.html#torch.autograd.Variable.register_hook

Here’s an example of calling the print function on the variable yy to print out it’s gradient (you can also define your own function that copies the gradient over else-where or modifies the gradient, for example.

from __future__ import print_function
from torch.autograd import Variable
import torch

xx = Variable(torch.randn(1,1), requires_grad = True)
yy = 3*xx
zz = yy**2

yy.register_hook(print)
zz.backward()

Output:

Variable containing:
-3.2480
[torch.FloatTensor of size 1x1]

Kalamaya · January 21, 2017, 8:06pm

Thanks @smth

The only way I have been able to really extract the gradient however is via a global variable at the moment. This is because the function I pass in (apparently) only allows me to pass in one argument, and that is reserved for the yy.grad. What I mean is given here:

yGrad = torch.zeros(1,1)
def extract(xVar):
	global yGrad
	yGrad = xVar	

xx = Variable(torch.randn(1,1), requires_grad = True)
yy = 3*xx
zz = yy**2

yy.register_hook(extract)

#### Run the backprop:
print (yGrad) # Shows 0.
zz.backward()
print (yGrad) # Show the correct dzdy

So here, I am able to extract the yy.grad, BUT, I can only do so with a global variable, which I would rather not do. Is there a simpler way? Many thanks.

mrdrozdov · January 21, 2017, 8:36pm

Might help to take a look at how optimizers update parameters using the gradient. For instance, this line / block of code in SGD. https://github.com/pytorch/pytorch/blob/master/torch/optim/sgd.py#L45

apaszke · January 21, 2017, 10:06pm

@mrdrozdov I don’t think this applies to this use case, because optimizers always work with leaf Variables.

@Kalamaya Is there any reason why using a closure is not acceptable? If you can give me some more details about your use case, and why do you need the intermediate gradient, I could probably suggest some other way.

Kalamaya · January 21, 2017, 10:18pm

@apaszke Ok - I am not familiar with closures, (learning python still), but from the googling I just did, sounds like it is acceptable for my solution: How would we use closures in this case? The examples I saw all have nested functions and I am not seeing the connection still… many thanks!!

apaszke · January 21, 2017, 10:32pm

You can think of a function that also keeps some additional variables from the outer scope. For example in here hook is a closure that remembers a name given to the outer function:

grads = {}
def save_grad(name):
    def hook(grad):
        grads[name] = grad
    return hook

x = Variable(torch.randn(1,1), requires_grad=True)
y = 3*x
z = y**2

# In here, save_grad('y') returns a hook (a function) that keeps 'y' as name
y.register_hook(save_grad('y'))
z.register_hook(save_grad('z'))
z.backward()

print(grads['y'])
print(grads['z'])

Kalamaya · January 21, 2017, 10:52pm

Many thanks! I will process this and let you know how it goes!

Just for my own knowledge, am I to understand that, given what I am trying to do, the only ways we have are i) global variables, and ii) closures?

Thanks again.

apaszke · January 21, 2017, 10:59pm

I’d say that these are the most obvious ways, but you could probably come up with more sophisticated solutions too. As I said, the best one depends on the specific use case, and it’s hard to provide a one that fits all. I find using closures like above to be ok, others will find something else better.

Kalamaya · January 21, 2017, 11:21pm

Thanks again. I will process it tonight and reply back here for my exact use case. Thanks again!

EvanZ · March 29, 2017, 3:58am

Aren’t the gradients of internal nodes necessary for doing backprop?

fmassa · March 29, 2017, 4:19pm

Yes, they are, but as soon as they have been used and are not necessary anymore, they are freed to save memory

vitchyr · June 14, 2017, 3:10am

While I understand why this design decision was made, are there any plans to make it easier to save the gradients of intermediate variables? For example, it’d be nice if something like this was supported:

from torch.autograd import Variable
import torch

xx = Variable(torch.randn(1,1), requires_grad = True)
yy = 3*xx
yy.require_grad = True  # <-- Override default behavior
zz = yy**2

zz.backward()

# do something with yy.grad

It seems like it’d be easier to let variables keep track of their own gradients rather than having to keep track of them with my own closures. Then if I want to analyze the gradients of my variables (leaf or not), I can do something like

do_something_with_data_and_grad_of(xx)
do_something_with_data_and_grad_of(yy)

Also, it might be useful to be able to set require_gradients for intermediate variables. For example, I might want to plot a histogram of intermediate variable gradients while not needing gradients for upstream variables. Right now, I’d have to set therequire_gradients flag True to upstream nodes just to make sure that the gradients for this intermediate node are computed, but that seems a bit wasteful.

chenjus · June 27, 2017, 12:20am

Is it possible to get the gradients of a torch.nn.Linear module using the way you suggested or am I limited to capturing gradients by defining Variables? Would this work for convolutions or recurrent layers?

miguelvr · June 27, 2017, 1:18pm

Is it possible to create a (torch.autograd) flag in order to save all the variable’s gradients?

yusaku · August 10, 2017, 11:54am

Looks like PyTorch 0.2.0 now has Variable.retain_grad(): http://pytorch.org/docs/master/autograd.html?highlight=retain_grad#torch.autograd.Variable.retain_grad

The above could now be done via
yy.retain_grad()

blackyang · October 21, 2017, 12:05am

Hi @smth, thanks for your reply. I have another question, suppose there are two heads on top of yy, how can we get grad_output from one of them, instead of the addition?

For example, how to get yy’s grad_output from zz1 part?

xx = Variable(torch.randn(1,1), requires_grad = True)
yy = 3*xx
zz1 = yy**2
zz2 = yy**2

yy.register_hook(print)
(zz1+zz2).backward()

SimonW · October 21, 2017, 1:49am

Ha, I recently did exactly this. Not sure if its the best way, but I did:

detach yy before feeding to get zzs, e.g. yyy = y.detach()
Manually call autograd.grad to get each of zzs grad w.r.t. yyy.
Save the one you want
call yy.backward(grad_to_yyy_1 + grad_to_yyy_2).

blackyang · October 23, 2017, 3:54pm

Great, thanks! That’s also what in my mind, basically we need a dummy variable.

BTW, is there something like nn.Identity() in torch? I didn’t find it

SimonW · October 23, 2017, 4:37pm

AFAIK, there is not. You can write one yourself though, although it won’t
be helpful in this case. The important thing is to detach yy from the graph
so you don’t backward through it to the part before it twice.