How do I get a list of all grad tensors that come in the computation graph of one loss function? (Assuming there are other branches in the computation graph as well., It is similar to Tensorflow’s compute_gradient() func where you we can get the list of all grads and vars from a given GraphKey.)
If you want to backpropagate with respect to a specific list of elements, you should use the
autograd.grad() function. Where
inputs could be for example
Thanks a lot for the help. What I actually want is those
subnet.parameters(). Is there a way to get those just from a given loss function?
You mean all the parameters that were used to compute your loss function? Isn’t that all the parameters of your network?
No. Thats what the issue is. It only uses a subbranch of the whole computation graph.
I guess then you can backward on the whole graph and check which ones are non-zero afterwards?
You mean check .grad of all the params? And is it zero or None?
Well if they were used to compute the loss, they will contain the gradient value, not zero (unless your gradient is 0 but then it’s like it was not used to compute your loss).
But yes that would be a simple solution.
I am not sure to grasp what you why you want to do that, but from the description, it looks like what you want.
Here is what I want to do:
So you mean the Variable would not have a
.grad attribute if its not used in computing that loss?
.grad attribute is
None only at creation, then after it’s been used once, it will be full of
zeros, this is done when you call
optimizer.zero_grad() usually. So an unused element will either have
None or only zeros (if you zeroed them out just before the backward, otherwise it will be whatever what there before + the gradient you just propagated).
Oh thanks. That helps.
Ok so the issue is I have some list of parameters which have non-zero
.grad values (backward from one loss function). Now I use those
.grad values to calculate another grad via
torch.autograd.grad which gives another set of
.grad values for only some of the parameters. Now I want to use different optimizers for updating those
.grad values. Do you have any insight on how to do this, other than making a copy of all the
.grad tensors from one loss and doing the
step() update separately?
I am not sure to see why you need “a different optimizer” ? What was the first one? You only have a single optimizer over all your parameters here no?
Would the following pseudo code do what you want?
# Outside: optimizer = optim.smth(net.parameters(), ...) # Training loop # Classic fw/bw out = net.forward(inp) original_loss = criterion(out, target) net.zero_grad() original_loss.backward(create_graph=True) # compute new loss new_loss = 0. for p in net.parameters(): if not p.grad.eq(0).all(): new_loss += your_custom_thing(p, p.grad) # backward new_loss to add it's gradients new_loss.backward() # Update your parameters according the gradients for new_loss + original_loss optimizer.step()
Yes this is somewhat close to what I want. But I want to use different optimizer for new_loss and a different one for original_loss. So essentially, I do not want to add the gradients from new_loss to ones from original_loss.
I have something like this:
op1 = optim.Adam(net.parameters(),...) op2 = optim.Adam(subnet.parameters(),...)
Also, do I need
create_graph=True since I am computing higher order derivatives later?
create_graph is because you will backpropagate the computed gradients.
But what is the difference between these two optimizers? if “subnet” is part of “net”, you could do the same as long as the gradients for what is not part of subnet are 0? Which from my understanding they are?
If the only difference is that you want to use two different learning rates, you can fix that by scaling the gradients you backpropagate by doing for example
No the grads of the part of the net not in subnet are not zero. Actually, new_loss takes all the grad values from all the params but calc grad of only those which occur in subnet.
But yeah right now the only difference is just the learning rate. But I might want to use an altogether different optimizer (SGD) for it. But for now I can just use your method. Does it work for any optimizer?
torch.autograd.grad() does not accumulate the grads in the
.grad attribute of the input variables. But as you see I want them to accumulate or how else would I use them?