Gradient of intermediate variable as loss input

I have a tensor `logZ` that’s the result of a computation. I want to use this tensor to compute 2 different losses used to update the parameters of `self.forward(input)`:

• One loss is computed as a function of the gradient of `logZ` w.r.t. another tensor, say `A`.
• Another loss is computed directly from `logZ`.

Would this be the right way to do this (especially regarding creating and retaining the differentiation graph correctly)?

``````A = self.forward(input)
logZ = some_differentiable_fn(A)

# compute input for first loss
logZ.sum().backward(retain_graph=True, create_graph=True)
marginals = A.grad
loss1 = some_other_differentiable_fn(marginals).mean()

# reset param gradients that got filled because of first .backward()
for p in self.parameters():
p.grad = None

# second loss and backward on sum of losses, update params
loss2 = (-logZ).mean()
(loss1 + loss2).backward()
optimizer.step()
``````

I tried computing `marginals` as `marginals = autograd.grad(logZ, A, retain_graph=True, create_graph=True)`, but I wasn’t sure whether the newly created graph out of this operation would reach the network’s parameters (since A is an output of the network and not an input).

Hi Ruben!

Yes, this is right (but I haven’t looked at your code in detail).

It’s probably a little more convenient to:

``````marginals = torch.autograd.grad (logZ.sum(), A, create_graph = True)[0]
``````

because `autograd.grad()` doesn’t set the parameter gradients, so you
don’t need to then reset them.

Yes, as noted above, this will work (provided you `.sum()` `logZ` to a
scalar and use `[0]` to extract the first (and only) term of the tuple
returned by `autograd.grad()`).

Yes, the fact that `A` is an output of the network is fine. `A` carries
`requires_grad = True`, so the computation graph continues forward
from `A` to `logZ`. It doesn’t matter that `A` is an “intermediate variable”
and not a “leaf” of the computation graph.

As an aside, when `create_graph = True`, `retain_graph` will default
to `True`, so you don’t have to specify it explicitly (but it doesn’t hurt if
you do).

Best.

K. Frank

Thank you for your answer, K. Frank :-). Indeed it seems a little more practical to use `autograd.grad` instead of `.backward`.

I was a little confused, since I guess `marginals = autograd.grad(logZ, A, create_graph=True)` backpropagates the gradient of `logZ` up until `A` and stops there, while I want `self`’s parameters to be updated. However, since we use `create_graph=True`, this differentiation gets attached to the computation graph, which does include the network’s parameters, so we can later do `.backward()` on the computed gradient of `logZ` and this time the new gradient will reach the parameters. Pretty neat.

Am I right to conclude then that `marginals = autograd.grad(logZ, A, create_graph=True)` should also be more efficient in time and space than `logZ.sum().backward()`, since we only actually differentiate up until the variable that we care about, and not the entire graph?

Hi Ruben!

This is correct. Furthermore, `.backward()` (unlike `autograd.grad()`)
will populate the `.grad` properties of the leaf tensors (typically your
model parameters) of the computation graph. This also takes time and
space.

Best.

K. Frank

1 Like