How are optimizer.step() and loss.backward() related?


I am pretty new to Pytorch and keep surprised with the performance of Pytorch :slight_smile:

I have followed tutorials and there’s one thing that is not clear.

How the optimizer.step() and loss.backward() related?

Does optimzer.step() function optimize based on the closest loss.backward() function?

When I check the loss calculated by the loss function, it is just a Tensor and seems it isn’t related with the optimizer.

Here’s my questions:

(1) Does optimzer.step() function optimize based on the closest loss.backward() function?

(2) what happens if I call several different backward() from losses and call optimizer.step()?
Does the optimizer optimize based on all previous called losses?

Thank you!

  1. optimizer.step is performs a parameter update based on the current gradient (stored in .grad attribute of a parameter) and the update rule. As an example, the update rule for SGD is defined here:

  2. Calling .backward() mutiple times accumulates the gradient (by addition) for each parameter. This is why you should call optimizer.zero_grad() after each .step() call. Note that following the first .backward call, a second call is only possible after you have performed another forward pass.

So for your first question, the update is not the based on the “closest” call but on the .grad attribute. How you calculate the gradient is upto you.


This is certainly not true if you specify retain_graph=True, and in some simple cases, it seems to be possible to backpropagate multiple times even without specifying retain_graph=True (but I don’t understand why). Also, the docks for backward say about retain_graph,

But I am not sure if this is really true. In architectures I have worked with I have often had to specify retain_graph=True, and if there are more efficient ways of doing what I needed to do, I couldn’t find them. (Is there some explanation somewhere of what these more efficient workarounds are and in what cases they work and in what apparently rare cases they fail?)

For instance, two cases I have encountered are when you have two different loss functions, used to update different parameters, but calculated using some of the same graph, and when you have an RNN and want to do backpropagation through time with overlapping backprop regions (like backprop 512 steps and then 256 steps later backprop another 512 steps).


@greaber, if you have two different loss functions, finish the forwards for both of them separately, and then finally you can do (loss1 + loss2).backward(). It’s a bit more efficient, skips quite some computation i believe.


But this assumes the different loss functions are used to compute grads for the same parameters, right? It doesn’t work in a GAN-like situation (although if the two loss functions are literally the negative of each other there might be some shortcut).

Yes it does, it doesn’t work in a GAN-like situation.


You are right, I probably should have mentioned this. I left it out because it is a bit more of an “advanced” use case.

1 Like

Thank you! I found it really helpful :slight_smile:

I’ve also been trying to understand the how things are happening at the lower level of the cudnn API calls. It seems that the loss and weight update is responsibility of the optimizer. In the case of cuda, that will just handle the output and gradient computation.

I wonder this topic too. There is no explicit connection between optimizer and loss objects in a program. Are they related implicitly via global variables i.e. loss.backward() data recorded somewhere?

I suppose there should be more obvious call like this:


Remember we defined optimizer = optim.SGD(parameters())?

1 Like

Yes, but the loss function does not deal with parameters only with predictions