# Backpropagation for each dimension of output

Hello!

In PyTorch when we call loss.backward() it performs backpropagation for the sample (for stochastic case). Let’s consider my output is 50 dimensional. I have two loss components. First one is an array of dimension 50. How can I run loss.backward() for each dimension separately? I also have another loss component which is a scalar and I want to do normal backpropagation for this one on whole output tensor, together with my first loss on each dimension of the output? Do I have to use a custom loss function?

Thanks a lot for you kind help!
Nil

Hi Nil!

I’m not sure that I understand your use case. In particular, what do you
mean by “run loss.backward() for each dimension separately?”

But, taking your question at face value, note that `loss.backward()`
accumulates the gradient of `loss` into the various `Parameter.grad`s.

So, for example:

``````loss_array.backward()   # possibly with retain_grad = True
loss_array.backward()
loss_array.backward()
``````

gives the same final result for the relevant `.grad` values as does:

``````(loss_array + loss_array + loss_array).backward()
``````

and the latter will be more efficient.

Best.

K. Frank

Hi K. Frank,

Thanks for your response and sorry for my late reply. My loss is an array of dimension 50. For PyTorch I know that loss has to be a scalar value. Since my loss is an array I can’t use default backward() function. This is what I tried.

loss.backward()
optim.step()

This is the error I got:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

I would really appreciate if you can give a direction where should I look at to make this special case work?

Thanks,
Nil

Hi Nil!

Presumably elements of your `gradient` array share parts of the same
computation graph. That is, the computation of `gradient` and, for
example, `gradient` partially overlap. Calling `gradient.backward()`
deletes `gradient`'s computation graph, including any parts of it that
are shared by `gradient`'s computation graph. So when you then call
`gradient.backward()`, parts of `gradient`'s computation graph
have been deleted, leading to the error you report.

`retain_graph = True` tells autograd not to delete the computation graph,
so you could do something like:

``````optim.zero_grad()
if  i < 49:
loss.backward (retain_graph = True)
else:
loss.backward()
optim.step()
``````

(The final call to `loss.backward()` does not have `retain_graph = True`
because you do need to delete computation graph at some point, typically
before calling `optim.step() and / or performing the next forward pass.)

This is a perfectly reasonable way to use autograd and `.backward()`.
However, it’s likely to be inefficient, because you repeat (the shared part
of) the backward pass fifty times.

`loss.backward()` computes the gradient of `loss` with respect to the
parameters on which `loss` depends and accumulates that gradient into
those parameters’ `.grad` properties. But computing the gradient is a linear
operation (so that `grad_of (a + b) = grad_of (a) + grad_of (b)`).

So you are likely better off with:

``````optim.zero_grad()
loss_total = 0
loss_total.backward()
optim.step()
``````

This only performs a single backward pass (rather than fifty) and, up to
numerical round-off error, computes the same final gradient (as stored in
the various parameters’ `.grad` properties) as does the version that called
`.backward()` fifty times.

As an aside, you will probably also achieve additional efficiency (and code
cleanliness) if you can arrange your computation so that `gradient` is a
single one-dimensional pytorch tensor of length fifty that is computed all at
once with pytorch tensor operations rather than an array of fifty length-one
pytorch tensors that is computed entry by entry.

Best.

K. Frank

Thank you so much Mr. Frank. This is really helpful. I am working on it and will update what I find.
I found a mistake in my problem formulation. The gradient I mentioned is a gradient of loss not the loss. Is there a way to directly use gradient (which is an array of dimension 50) using Pytorch?

Hi Nil!