Clone the grad attribute

I am implementation an algorithm of the following type:

at each iteration, the algorithm computes gradients of two objectives f and g with respect to the parameters, and combine them in some way, and then use the combined result as the effective gradient for SGD or Adam… (The combination of the gradients are not linear. So that I can’t just combine the objective f and g first and take a single gradient)

Is it true that the best way for me to do it is to compute the gradient of f by


and then clone all the grad attribute of the parameters

and then do the same thing for g

and then combine them and assign the .grad attributes with the new effective grad, and call the opt.step() at the end?

Is there any better way to do this?

I’m not sure why you would need to clone them. You can just calculate and reassign.

The better way probably is writinh your own optimizer that extends torch.optim.Optimizer.

Slightly neater way.

p = list(model.parameters())
grad_f = torch.autograd.grad(f, p, retrain_graph=True)
grad_g = torch.autograd.grad(g, p)

for i in range(len(p)):
  p[i].grad = func(grad_f[i], grad_g[i])

I think you would need clone because it’s likely that when I call g.backward(), the .grad attributes will be changed.

Actually I am not sure what do you mean by calculating and reassigning.

Regarding “The better way probably is writinh your own optimizer that extends torch.optim.Optimizer.”

There is no issue with extending the optimizer. The question is how to extend it. That’s exactly what I asked.


But I think I have no idea how torch.autograd.grad does … Does it just return a list of tensors that contains the gradient of f with respect to p?

If that’s the case then this should work for me.

It will return a list of Variables. In default they are requires_grad=False Variables so they are basically tensors.

Oh I see. Yeah they will be changed. So regarding the first route, @ruotianluo 's approach is better.

I see. Just to clarify, the value of grad_f won’t be changed by any other operations in the future, right? (By contrast, the .grad attribute can be potentially changed by other operations. )

So this sounds a much cleaner way to deal with this kind of issues. Thanks!

They won’t be changed. Also if you do torch.autograd.grad, by default the gradient of the parameters will also remain None.