How are optimizer.step() and loss.backward() related?


I am pretty new to Pytorch and keep surprised with the performance of Pytorch :slight_smile:

I have followed tutorials and there’s one thing that is not clear.

How the optimizer.step() and loss.backward() related?

Does optimzer.step() function optimize based on the closest loss.backward() function?

When I check the loss calculated by the loss function, it is just a Tensor and seems it isn’t related with the optimizer.

Here’s my questions:

(1) Does optimzer.step() function optimize based on the closest loss.backward() function?

(2) what happens if I call several different backward() from losses and call optimizer.step()?
Does the optimizer optimize based on all previous called losses?

Thank you!

  1. optimizer.step is performs a parameter update based on the current gradient (stored in .grad attribute of a parameter) and the update rule. As an example, the update rule for SGD is defined here:

  2. Calling .backward() mutiple times accumulates the gradient (by addition) for each parameter. This is why you should call optimizer.zero_grad() after each .step() call. Note that following the first .backward call, a second call is only possible after you have performed another forward pass.

So for your first question, the update is not the based on the “closest” call but on the .grad attribute. How you calculate the gradient is upto you.


This is certainly not true if you specify retain_graph=True, and in some simple cases, it seems to be possible to backpropagate multiple times even without specifying retain_graph=True (but I don’t understand why). Also, the docks for backward say about retain_graph,

But I am not sure if this is really true. In architectures I have worked with I have often had to specify retain_graph=True, and if there are more efficient ways of doing what I needed to do, I couldn’t find them. (Is there some explanation somewhere of what these more efficient workarounds are and in what cases they work and in what apparently rare cases they fail?)

For instance, two cases I have encountered are when you have two different loss functions, used to update different parameters, but calculated using some of the same graph, and when you have an RNN and want to do backpropagation through time with overlapping backprop regions (like backprop 512 steps and then 256 steps later backprop another 512 steps).


@greaber, if you have two different loss functions, finish the forwards for both of them separately, and then finally you can do (loss1 + loss2).backward(). It’s a bit more efficient, skips quite some computation i believe.


But this assumes the different loss functions are used to compute grads for the same parameters, right? It doesn’t work in a GAN-like situation (although if the two loss functions are literally the negative of each other there might be some shortcut).

1 Like

Yes it does, it doesn’t work in a GAN-like situation.


You are right, I probably should have mentioned this. I left it out because it is a bit more of an “advanced” use case.

1 Like

Thank you! I found it really helpful :slight_smile:

I’ve also been trying to understand the how things are happening at the lower level of the cudnn API calls. It seems that the loss and weight update is responsibility of the optimizer. In the case of cuda, that will just handle the output and gradient computation.

I wonder this topic too. There is no explicit connection between optimizer and loss objects in a program. Are they related implicitly via global variables i.e. loss.backward() data recorded somewhere?

I suppose there should be more obvious call like this:


Remember we defined optimizer = optim.SGD(parameters())?


Yes, but the loss function does not deal with parameters only with predictions

1 Like

What do you mean by saying the following sentence? “this assumes the different loss functions are used to compute grads for the same parameters”.

And what does a GAN-like situation refer to?

I am a little confused and not sure when to use (loss1 + loss2).backward() or loss1.backward(retain_graph=True) loss2.backward(). Why is one more efficient than the other? Are these two methods mathematically equivalent?

According to Sam Bobel in Stack Overflow - What does the parameter retain_graph mean in the Variable’s backward() method? , these two methods are not mathematically equivalent if one uses adaptive gradient optimizers like ADAM, as shown below:

Do you agree with Sam Bobel?

I add the second loss to the first loss and expect that the gradients, weights and result changes. But there is no change and difference with the time I just use one loss function. The fist one isBCELoss and the second one is L1.
I check the gradients in both case with loss1 and with loss1+loss2 but gradients were same. exact same. adding more loss does not have effect on gradient even if I used loss1+10*loss2

netG = Generator994(ngpu,nz,ngf).to(device)

optimizerG = optim.Adam(netG.parameters(), lr=lr2, betas=(beta1, 0.999))


output = netD(fake).view(-1)
# Calculate G's loss based on this output
loss1 = criterion(output, label)

xxx=torch.histc(Gaussy.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)

xxx1=torch.histc(fake.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)


# Calculate gradients for G with 2 loss


for param in netG.parameters():
# Update G
## ------------------
class Generator994(nn.Module):
    def __init__(self,ngpu,nz,ngf):
        super(Generator994, self).__init__()
        self.l1= nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d(, self.ngf * 8, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 8),
            # state size. (ngf*8) x 4 x 4
        self.l2=nn.Sequential(nn.ConvTranspose2d(self.ngf * 8, self.ngf * 4, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 4),
            # state size. (ngf*4) x 8 x 8
        self.l3=nn.Sequential(nn.ConvTranspose2d( self.ngf * 4, self.ngf * 2, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 2),
            # state size. (ngf*2) x 16 x 16
        self.l4=nn.Sequential(nn.ConvTranspose2d( self.ngf*2, 1, 3, 1, 0, bias=False),nn.Sigmoid()
#            nn.Tanh()
            # state size. (nc) x 64 x 64

    def forward(self, input):
        return out

Double post with answer from here.