Extracting grad tensors from a subgraph

How do I get a list of all grad tensors that come in the computation graph of one loss function? (Assuming there are other branches in the computation graph as well., It is similar to Tensorflow’s compute_gradient() func where you we can get the list of all grads and vars from a given GraphKey.)

Hi,

If you want to backpropagate with respect to a specific list of elements, you should use the autograd.grad() function. Where inputs could be for example subnet.parameters().

Thanks a lot for the help. What I actually want is those subnet.parameters(). Is there a way to get those just from a given loss function?

You mean all the parameters that were used to compute your loss function? Isn’t that all the parameters of your network?

No. Thats what the issue is. It only uses a subbranch of the whole computation graph.

I guess then you can backward on the whole graph and check which ones are non-zero afterwards?

You mean check .grad of all the params? And is it zero or None?

Well if they were used to compute the loss, they will contain the gradient value, not zero (unless your gradient is 0 but then it’s like it was not used to compute your loss).
But yes that would be a simple solution.

I am not sure to grasp what you why you want to do that, but from the description, it looks like what you want.

Here is what I want to do:

So you mean the Variable would not have a .grad attribute if its not used in computing that loss?

The .grad attribute is None only at creation, then after it’s been used once, it will be full of zeros, this is done when you call optimizer.zero_grad() usually. So an unused element will either have None or only zeros (if you zeroed them out just before the backward, otherwise it will be whatever what there before + the gradient you just propagated).

Oh thanks. That helps.

Ok so the issue is I have some list of parameters which have non-zero .grad values (backward from one loss function). Now I use those .grad values to calculate another grad via torch.autograd.grad which gives another set of .grad values for only some of the parameters. Now I want to use different optimizers for updating those .grad values. Do you have any insight on how to do this, other than making a copy of all the .grad tensors from one loss and doing the step() update separately?

I am not sure to see why you need “a different optimizer” ? What was the first one? You only have a single optimizer over all your parameters here no?
Would the following pseudo code do what you want?

# Outside:
optimizer = optim.smth(net.parameters(), ...)

# Training loop
# Classic fw/bw
out = net.forward(inp)
original_loss = criterion(out, target)
net.zero_grad()
original_loss.backward(create_graph=True)

# compute new loss
new_loss = 0.
for p in net.parameters():
    if not p.grad.eq(0).all():
        new_loss += your_custom_thing(p, p.grad)

# backward new_loss to add it's gradients
new_loss.backward()

# Update your parameters according the gradients for new_loss + original_loss
optimizer.step()

Yes this is somewhat close to what I want. But I want to use different optimizer for new_loss and a different one for original_loss. So essentially, I do not want to add the gradients from new_loss to ones from original_loss.
I have something like this:

op1 = optim.Adam(net.parameters(),...)
op2 = optim.Adam(subnet.parameters(),...)

Also, do I need create_graph=True since I am computing higher order derivatives later?

Yes the create_graph is because you will backpropagate the computed gradients.

But what is the difference between these two optimizers? if “subnet” is part of “net”, you could do the same as long as the gradients for what is not part of subnet are 0? Which from my understanding they are?

If the only difference is that you want to use two different learning rates, you can fix that by scaling the gradients you backpropagate by doing for example new_loss.backward(torch.Tensor([2])).

No the grads of the part of the net not in subnet are not zero. Actually, new_loss takes all the grad values from all the params but calc grad of only those which occur in subnet.

But yeah right now the only difference is just the learning rate. But I might want to use an altogether different optimizer (SGD) for it. But for now I can just use your method. Does it work for any optimizer?

Also torch.autograd.grad() does not accumulate the grads in the .grad attribute of the input variables. But as you see I want them to accumulate or how else would I use them?