Backpropagation for each dimension of output

nil_tdr · July 30, 2023, 5:52pm

Hello!

In PyTorch when we call loss.backward() it performs backpropagation for the sample (for stochastic case). Let’s consider my output is 50 dimensional. I have two loss components. First one is an array of dimension 50. How can I run loss.backward() for each dimension separately? I also have another loss component which is a scalar and I want to do normal backpropagation for this one on whole output tensor, together with my first loss on each dimension of the output? Do I have to use a custom loss function?

Thanks a lot for you kind help!
Nil

KFrank · July 31, 2023, 2:45am

Hi Nil!

I’m not sure that I understand your use case. In particular, what do you
mean by “run loss.backward() for each dimension separately?”

But, taking your question at face value, note that loss.backward()
accumulates the gradient of loss into the various Parameter.grads.

So, for example:

loss_array[0].backward()   # possibly with retain_grad = True
loss_array[1].backward()
loss_array[2].backward()

gives the same final result for the relevant .grad values as does:

(loss_array[0] + loss_array[1] + loss_array[2]).backward()

and the latter will be more efficient.

Best.

K. Frank

nil_tdr · August 4, 2023, 2:17pm

Hi K. Frank,

Thanks for your response and sorry for my late reply. My loss is an array of dimension 50. For PyTorch I know that loss has to be a scalar value. Since my loss is an array I can’t use default backward() function. This is what I tried.

optim.zero_grad()
for grad in gradient: #gradient is a 50 dimensional array
loss = grad
loss.backward()
optim.step()

This is the error I got:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

I would really appreciate if you can give a direction where should I look at to make this special case work?

Thanks,
Nil

KFrank · August 5, 2023, 1:51am

Hi Nil!

Presumably elements of your gradient array share parts of the same
computation graph. That is, the computation of gradient[0] and, for
example, gradient[1] partially overlap. Calling gradient[0].backward()
deletes gradient[0]'s computation graph, including any parts of it that
are shared by gradient[1]'s computation graph. So when you then call
gradient[1].backward(), parts of gradient[1]'s computation graph
have been deleted, leading to the error you report.

retain_graph = True tells autograd not to delete the computation graph,
so you could do something like:

optim.zero_grad()
for  i, grad in enumerate (gradient):   # gradient is a 50 dimensional array
    loss = grad
    if  i < 49:
        loss.backward (retain_graph = True)
    else:
        loss.backward()
optim.step()

(The final call to loss.backward() does not have retain_graph = True
because you do need to delete computation graph at some point, typically
before calling `optim.step() and / or performing the next forward pass.)

This is a perfectly reasonable way to use autograd and .backward().
However, it’s likely to be inefficient, because you repeat (the shared part
of) the backward pass fifty times.

loss.backward() computes the gradient of loss with respect to the
parameters on which loss depends and accumulates that gradient into
those parameters’ .grad properties. But computing the gradient is a linear
operation (so that grad_of (a + b) = grad_of (a) + grad_of (b)).

So you are likely better off with:

optim.zero_grad()
loss_total = 0
for  grad in gradient:
    loss_total = loss_total + grad
loss_total.backward()
optim.step()

This only performs a single backward pass (rather than fifty) and, up to
numerical round-off error, computes the same final gradient (as stored in
the various parameters’ .grad properties) as does the version that called
.backward() fifty times.

As an aside, you will probably also achieve additional efficiency (and code
cleanliness) if you can arrange your computation so that gradient is a
single one-dimensional pytorch tensor of length fifty that is computed all at
once with pytorch tensor operations rather than an array of fifty length-one
pytorch tensors that is computed entry by entry.

Best.

K. Frank

nil_tdr · August 6, 2023, 6:49pm

Thank you so much Mr. Frank. This is really helpful. I am working on it and will update what I find.
I found a mistake in my problem formulation. The gradient I mentioned is a gradient of loss not the loss. Is there a way to directly use gradient (which is an array of dimension 50) using Pytorch?

KFrank · August 7, 2023, 12:53am

Hi Nil!

I don’t know what your use case is and I don’t follow what you are asking.

If you want to compute the gradient of a gradient (so, a second derivative),
you want, more or less, the so-called hessian.

Autograd supports computing gradients of gradients. You might start by
taking a look at hessian().

Best.

K. Frank