What does the backward() function do?

I have two networks, “net1” and "net2"
Let us say “loss1” and “loss2” represents the loss function of “net1” and “net2” classifier’s loss.
lets say “optimizer1” and “optimizer2” are the optimizers of both networks.

“net2” is a pretrained network and I want to backprop the (gradients of) the loss of “net2” into “net1”.
loss1=…some loss defined
So, loss1 = loss1 + loss2 (lets say that loss2 was defined initially)

So I do
loss1.backward(retain_graph=True) #what happens when I write this

What is the difference between backward() and step() ???

If I do not write loss1.backward() what will happen ??


loss.backward() computes dloss/dx for every parameter x which has requires_grad=True. These are accumulated into x.grad for every parameter x. In pseudo-code:

x.grad += dloss/dx

optimizer.step updates the value of x using the gradient x.grad. For example, the SGD optimizer performs:

x += -lr * x.grad

optimizer.zero_grad() clears x.grad for every parameter x in the optimizer. It’s important to call this before loss.backward(), otherwise you’ll accumulate the gradients from multiple passes.

If you have multiple losses (loss1, loss2) you can sum them and then call backwards once:

loss3 = loss1 + loss2

Hi @colesbury, thanks for your illustration.

I have one more question. Lets say I want to backprop “loss3” into “net1” and do not want to backprop “loss 2” to “net2”. In that case I should not write


I should only write

loss3 = loss1 + loss2
loss3.backward(). RIGHT ??

In case I have written loss2.backward() and have not written optimizer2.step(), will that affect my gradients when I compute loss3.backward(). ???

Does backward update the weights, if we do not use optimizer?

1 Like

no it just computes gradients

1 Like

Very clear explaination!. If I have two losses:loss1 get data from dataloader1 with batch size of 4, while loss2 get data from loader2 with batch size of 4. Then what is batch size of loss=loss1+loss2? Is it 4 or 8? The code likes



No, it does not. Update happens only when you call step().

I was reading here that:

In the next iteration, a fresh new graph is created and ready for back-propagation.

I am wondering. When exactly is the fresh new graph created? Is it when we call:

  • optimizer.zero_grad()
  • output.backward()
  • optimizer.step()

or at some other time?

I am wondering. When exactly is the fresh new graph created?

It’s created during the forward pass. i.e. when you write something like:

loss = criterion(model(input), target)

The graph is accessible through loss.grad_fn and the chain of autograd Function objects.

The graph is used by loss.backward() to compute gradients.

optimizer.zero_grad() and optimizer.step() do not affect the graph of autograd objects. They only touch the model’s parameters and the parameter’s grad attributes.


If there are several branches / subgraphs - would it be beneficial or even possible to do loss.backward() on the subgraphs? I’m hoping that when one branch finishes early it might free up memory this way.

I know that you can add the losses together and do one losses.backward() btw

Here, x only represent the parameters that contribute to the ‘loss’, right ? (a.k.a loss.backward() doesn’t affect other variables that without contribution to ‘loss’ )


I want to implement the backward graph separately which means dropping loss.backward() and substituting that with a network that accepts error as input and gives gradients in each layer. For example, for MSE loss it is intuitive to use error = target-output as the input to the backward graph (which is in fully_connected network, is the transposed of the forward graph).
Pytorch loss functions give the loss and not the tensor which is given as input to the backward graph. Is there any easy way to access the input to the backward graph after computing loss? (e.g. loss = nn.CrossEntropyLoss()(outputs, targets))


I’m not sure which input you are looking for, but you can pass the gradient directly to the backward function.
The default would be a scalar value of 1. If you need to provide a specific gradient, you could use loss.backward(gradient=...).

1 Like

Thank you for your response.
I explained my confusion with loss.backward() in another topic.

I have a very similar implementation however it askes for retain_graph=True and then that slows down the code so that it’s impractical to train. Any thoughts?

why the x.grad is added with the dloss/dx… Insted it should be multiplied with learning rate and add with the older weight values.
Because, new_weight = old_weight - (learning_rate)*x.grade

  1. When we pass the 1st batch for the forward pass and compute the loss for the 1st batch.
  2. We calculate the back propagation to compute d_loss/dx for all layers.
  3. Then with optimization technique we updates the weights with help of optimizer.step function for the 1st batch.
    Later for the second batch whether the updated weights from the 1st batch will be used or what. And before applying the backward() function for second batch weather we should do optimizer.zero_grade() or WHAT???

Hi @colesbury

I used two loss function loss=loss1+loss2, and I expect to have different gradient when I use just loss=loss1,But the gradient flow and numbers was same.indeed adding second loss does not have any effect. Would you pleas help me with that? I try different second loss but the result does not have any change. The first loss is BCELoss and the second one is L1. I change the sigmoid function to Relu, But again the gradient from backward.() with loss2 and without loss2 is same!

netG = Generator(ngpu,nz,ngf).to(device)

optimizerG = optim.Adam(netG.parameters(), lr=lr2, betas=(beta1, 0.999))


output = netD(fake).view(-1)
# Calculate G's loss based on this output
loss1 = criterion(output, label)

xxx=torch.histc(Gaussy.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)

xxx1=torch.histc(fake.squeeze(1).view(-1).cpu(),100, min=0, max=1, out=None)

# Calculate gradients for G with 2 loss


for param in netG.parameters():

# Update G

## ------------------
class Generator(nn.Module):
    def __init__(self,ngpu,nz,ngf):
        super(Generator, self).__init__()
        self.l1= nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d(self.nz, self.ngf * 8, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 8),
            # state size. (ngf*8) x 4 x 4
        self.l2=nn.Sequential(nn.ConvTranspose2d(self.ngf * 8, self.ngf * 4, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 4),
            # state size. (ngf*4) x 8 x 8
        self.l3=nn.Sequential(nn.ConvTranspose2d( self.ngf * 4, self.ngf * 2, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 2),
            # state size. (ngf*2) x 16 x 16
        self.l4=nn.Sequential(nn.ConvTranspose2d( self.ngf*2, 1, 3, 1, 0, bias=False),nn.Sigmoid()
#            nn.Tanh()
            # state size. (nc) x 64 x 64

    def forward(self, input):
        return out

Double post with answer from here.

1 Like

hi @colesbury, I am trying to do a similar thing where I have a reconstruction loss and a kernel alignment loss. They are calculated as below:

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

        self.We1 = torch.nn.Parameter(torch.Tensor(input_length, args.hidden_size).uniform_(-1.0 / math.sqrt(input_length), 1.0 / math.sqrt(input_length)))
        self.We2 = torch.nn.Parameter(torch.Tensor(args.hidden_size, args.code_size).uniform_(-1.0 / math.sqrt(args.hidden_size), 1.0 / math.sqrt(args.hidden_size)))

        self.be1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
        self.be2 = torch.nn.Parameter(torch.zeros([args.code_size]))

    def encoder(self, encoder_inputs):
        hidden_1 = torch.tanh(torch.matmul(encoder_inputs.float(), self.We1) + self.be1)
        code = torch.tanh(torch.matmul(hidden_1, self.We2) + self.be2)
        return code

    def decoder(self,encoder_inputs):
        code = self.encoder(encoder_inputs)

        # ----- DECODER -----
        if tied_weights:

            Wd1 = torch.transpose(We2)
            Wd2 = torch.transpose(We1)


            Wd1 = torch.nn.Parameter(
                torch.Tensor(args.code_size, args.hidden_size).uniform_(-1.0 / math.sqrt(args.code_size),
                                                                           1.0 / math.sqrt(args.code_size)))
            Wd2 = torch.nn.Parameter(
                torch.Tensor(args.hidden_size, input_length).uniform_(-1.0 / math.sqrt(args.hidden_size),
                                                                             1.0 / math.sqrt(args.hidden_size)))

            bd1 = torch.nn.Parameter(torch.zeros([args.hidden_size]))
            bd2 = torch.nn.Parameter(torch.zeros([input_length]))

            if lin_dec:
                hidden_2 = torch.matmul(code, Wd1) + bd1
                hidden_2 = torch.tanh(torch.matmul(code, Wd1) + bd1)

            dec_out = torch.matmul(hidden_2, Wd2) + bd2

        return  dec_out

    def kernel_loss(self,code, prior_K):
        # kernel on codes
        code_K = torch.mm(code, torch.t(code))

        # ----- LOSS -----
        # kernel alignment loss with normalized Frobenius norm
        code_K_norm = code_K / torch.linalg.matrix_norm(code_K, ord='fro', dim=(- 2, - 1))
        prior_K_norm = prior_K / torch.linalg.matrix_norm(prior_K, ord='fro', dim=(- 2, - 1))
        k_loss = torch.linalg.matrix_norm(torch.sub(code_K_norm,prior_K_norm), ord='fro', dim=(- 2, - 1))
        return k_loss

# Initialize model
model = Model()

Now, during training I pass my training data as inputs to the encoder and decoder.

for ep in range(args.num_epochs):
    for batch in range(max_batches):
        # get input data
            dec_out = model.decoder(encoder_inputs)
            reconstruct_loss = torch.mean((dec_out - encoder_inputs) ** 2)
            enc_out = model.encoder(encoder_inputs)
            k_loss = model.kernel_loss(enc_out,prior_K)

            tot_loss = reconstruct_loss + args.w_reg * reg_loss + args.a_reg * k_loss
            tot_loss = tot_loss.float()

            # Backpropagation
            torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_gradient_norm)

This always gives me an error saying “RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).
Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need
to backward through the graph a second time”

It works only when I activate the retain_graph flag. But takes huge time for training. Can you please let me know what wrong I am doing here?

Thank you!