 # What does the backward() function do?

I have two networks, “net1” and "net2"
Let us say “loss1” and “loss2” represents the loss function of “net1” and “net2” classifier’s loss.
lets say “optimizer1” and “optimizer2” are the optimizers of both networks.

“net2” is a pretrained network and I want to backprop the (gradients of) the loss of “net2” into “net1”.
loss1=…some loss defined
So, loss1 = loss1 + loss2 (lets say that loss2 was defined initially)

So I do
loss1.backward(retain_graph=True) #what happens when I write this
loss2.backward()
optimizer2.step()

What is the difference between backward() and step() ???

If I do not write loss1.backward() what will happen ??

9 Likes

`loss.backward()` computes `dloss/dx` for every parameter `x` which has `requires_grad=True`. These are accumulated into `x.grad` for every parameter `x`. In pseudo-code:

``````x.grad += dloss/dx
``````

`optimizer.step` updates the value of `x` using the gradient `x.grad`. For example, the SGD optimizer performs:

``````x += -lr * x.grad
``````

`optimizer.zero_grad()` clears `x.grad` for every parameter `x` in the optimizer. It’s important to call this before `loss.backward()`, otherwise you’ll accumulate the gradients from multiple passes.

If you have multiple losses (loss1, loss2) you can sum them and then call backwards once:

``````loss3 = loss1 + loss2
loss3.backward()
``````
65 Likes

Hi @colesbury, thanks for your illustration.

I have one more question. Lets say I want to backprop “loss3” into “net1” and do not want to backprop “loss 2” to “net2”. In that case I should not write

loss2.backward()
optimizer2.step().

I should only write

loss3 = loss1 + loss2
loss3.backward(). RIGHT ??

In case I have written loss2.backward() and have not written optimizer2.step(), will that affect my gradients when I compute loss3.backward(). ???

Does backward update the weights, if we do not use optimizer?

1 Like

1 Like

Very clear explaination!. If I have two losses:loss1 get data from dataloader1 with batch size of 4, while loss2 get data from loader2 with batch size of 4. Then what is batch size of loss=loss1+loss2? Is it 4 or 8? The code likes

``````optimizer.zero_grad()
loss1=CrossEntropyLoss(output1,target1)
loss2=CrossEntropyLoss(output2,target2)
loss=loss1+loss2
loss.backward()
optimizer.step()

``````
1 Like

No, it does not. Update happens only when you call step().

1 Like

In the next iteration, a fresh new graph is created and ready for back-propagation.

I am wondering. When exactly is the fresh new graph created? Is it when we call:

• `optimizer.zero_grad()`
• `output.backward()`
• `optimizer.step()`

or at some other time?

I am wondering. When exactly is the fresh new graph created?

It’s created during the forward pass. i.e. when you write something like:

`loss = criterion(model(input), target)`

The graph is accessible through `loss.grad_fn` and the chain of autograd `Function` objects.

The graph is used by `loss.backward()` to compute gradients.

`optimizer.zero_grad()` and `optimizer.step()` do not affect the graph of autograd objects. They only touch the model’s parameters and the parameter’s `grad` attributes.

3 Likes

If there are several branches / subgraphs - would it be beneficial or even possible to do loss.backward() on the subgraphs? I’m hoping that when one branch finishes early it might free up memory this way.

I know that you can add the losses together and do one losses.backward() btw

Here, x only represent the parameters that contribute to the ‘loss’, right ? (a.k.a loss.backward() doesn’t affect other variables that without contribution to ‘loss’ )

Hi,

I want to implement the backward graph separately which means dropping `loss.backward()` and substituting that with a network that accepts error as input and gives gradients in each layer. For example, for MSE loss it is intuitive to use `error = target-output` as the input to the backward graph (which is in fully_connected network, is the transposed of the forward graph).
Pytorch loss functions give the loss and not the tensor which is given as input to the backward graph. Is there any easy way to access the input to the backward graph after computing loss? (e.g. `loss = nn.CrossEntropyLoss()(outputs, targets)`)

Thanks

I’m not sure which input you are looking for, but you can pass the `gradient` directly to the `backward` function.
The default would be a scalar value of 1. If you need to provide a specific gradient, you could use `loss.backward(gradient=...)`.

1 Like