I have a tensor logZ that’s the result of a computation. I want to use this tensor to compute 2 different losses used to update the parameters of self.forward(input):

One loss is computed as a function of the gradient of logZ w.r.t. another tensor, say A.

Another loss is computed directly from logZ.

Would this be the right way to do this (especially regarding creating and retaining the differentiation graph correctly)?

A = self.forward(input)
logZ = some_differentiable_fn(A)
# compute input for first loss
logZ.sum().backward(retain_graph=True, create_graph=True)
marginals = A.grad
loss1 = some_other_differentiable_fn(marginals).mean()
# reset param gradients that got filled because of first .backward()
for p in self.parameters():
p.grad = None
# second loss and backward on sum of losses, update params
loss2 = (-logZ).mean()
(loss1 + loss2).backward()
optimizer.step()

I tried computing marginals as marginals = autograd.grad(logZ, A, retain_graph=True, create_graph=True), but I wasn’t sure whether the newly created graph out of this operation would reach the network’s parameters (since A is an output of the network and not an input).

Yes, this is right (but I haven’t looked at your code in detail).

It’s probably a little more convenient to:

marginals = torch.autograd.grad (logZ.sum(), A, create_graph = True)[0]

because autograd.grad() doesn’t set the parameter gradients, so you
don’t need to then reset them.

Yes, as noted above, this will work (provided you .sum()logZ to a
scalar and use [0] to extract the first (and only) term of the tuple
returned by autograd.grad()).

Yes, the fact that A is an output of the network is fine. A carries requires_grad = True, so the computation graph continues forward
from A to logZ. It doesn’t matter that A is an “intermediate variable”
and not a “leaf” of the computation graph.

As an aside, when create_graph = True, retain_graph will default
to True, so you don’t have to specify it explicitly (but it doesn’t hurt if
you do).

Thank you for your answer, K. Frank :-). Indeed it seems a little more practical to use autograd.grad instead of .backward.

I was a little confused, since I guess marginals = autograd.grad(logZ, A, create_graph=True) backpropagates the gradient of logZ up until A and stops there, while I want self’s parameters to be updated. However, since we use create_graph=True, this differentiation gets attached to the computation graph, which does include the network’s parameters, so we can later do .backward() on the computed gradient of logZ and this time the new gradient will reach the parameters. Pretty neat.

Am I right to conclude then that marginals = autograd.grad(logZ, A, create_graph=True) should also be more efficient in time and space than logZ.sum().backward(), since we only actually differentiate up until the variable that we care about, and not the entire graph?

This is correct. Furthermore, .backward() (unlike autograd.grad())
will populate the .grad properties of the leaf tensors (typically your
model parameters) of the computation graph. This also takes time and
space.