Backward multiple forward passes

celineteller · April 15, 2021, 1:16pm

Hello,

I perform two forward passes before a single backward pass as follows:

y_a = net.forward(image_a)
y_b = net.forward(image_b)
loss.backward()
optimizer.step()

I would like to know how concretely does the backward pass take into account both y_a and y_b in the weights update.

I have already checked similar posts on the forum, but none of these posts is clearly explaining the mathematical part of the backward of multiple forwards.

Thank you,

ptrblck · April 16, 2021, 7:23am

In your current code snippet you don’t show how the loss is calculated, so it’s unclear how the loss.backward() pass would work.
In case you are calculating the loss using both outputs, Autograd will use these two computation graphs to calculate the gradients for all parameters. The actual backward operations depend on the used operations during the forward pass.
E.g. if you are adding both outputs together, the backward pass would call the backward of torch.add.

celineteller · April 16, 2021, 8:58am

Thank you for your answer.

The loss function is a little big long. In brief, I summed up the difference between my two predictions.

But, as I have a completely independent forward passes, how the two computation graph are accumulated to calculate the gradients for all parameters ?

Thank’s again.

ptrblck · April 18, 2021, 5:19am

Based on your description I don’t think they are completely independent, since you are summing them at one point. Autograd will handle them as a standard addition and will backpropagate through the addition as seen here:

x = torch.randn(1, requires_grad=True)
lossX = x * 2

y = torch.randn(1, requires_grad=True)
lossY = y * 3

out = lossX + lossY
out.backward()
print(x.grad)
print(y.grad)

Note that the loss calculation could of course be more complicated than in this simple example.

Kapil_Rana · April 18, 2021, 5:51am

Yes, these are completely independent forward passes. but at some point for calculating over the loss, you are using the output of both forward passes (directly or indirectly).
So the backpropagation will also follow the calculated loss formula.
in the case of sum, it divides the loss to both forward branches, the ratio will depend on the forward passed outputs.
So it depends on loss function calculation and forwards pass outputs, that how it passes the loss to individual branches.

One more thing,
If I guess you are using the same branch (net.forword) for two inputs image_a and image_b. it will consider only the last loss calculated (with y_b).
what you can do you can get loss for y_a first and then for y_b. add them both and then do loss.backward().
you can think about how loss is calculated in the case of batches vs in the case of a single input image.

celineteller · April 20, 2021, 7:51am

Thank you for your answers.

A simplified version of my loss function is loss = (y_a - y_b)

So yes, I am using the two predictions in loss computation. But, as I perform a single zero_grad, I think that this yields to gradients accumulation a not to only take into account the last gradient from the last prediction at one step.

fotinidelig · September 28, 2021, 2:23pm

Hi!
I want to ask a maybe trivial question regarding the issue with two forward passes and one backward pass. Is doing:

loss1 = loss_fun1(model(x))
loss2 = loss_fun2(model(y))
(loss1+loss2).backward()
optimizer.step()

the same as doing:

loss1 = loss_fun1(model(x))
loss1.backward()
loss2 = loss_fun2(model(y))
loss2.backward()
optimizer.step()

?
I’ve tried both cases but neither seems to work and throws the popular error: Trying to backward through the graph a second time etc.
Also, trying retain_graph=True doesn’t make it work either.

In my understanding shouldn’t the gradient of both the computational graphs created from loss1 and loss2 be accumulated? What is it that I’m missing?

ptrblck · September 29, 2021, 4:45am

Yes, if no dependency between the iterations is introduced, both approaches would work as seen here:

# setup
model = models.resnet18()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
x = torch.randn(1, 3, 224, 224)
y = torch.randn(1, 3, 224, 224)

# 1st approach
loss1 = model(x).mean()
loss2 = model(y).mean()
(loss1+loss2).backward()
optimizer.step()

# 2nd approach
loss1 = model(x).mean()
loss1.backward()
loss2 = model(y).mean()
loss2.backward()
optimizer.step()

fotinidelig · September 29, 2021, 11:34am

Hmm your code works great, but although using approach #1 I still get the error:
Trying to backward through the graph a second time (or directly access saved variables after they have already been freed)…
except if I change the second loss function according to the comment

Maybe it’s triggered due to the loss functions that I’m using?

    net.train()
    optimizer.zero_grad()

    pred = net(x)
    loss_nat = nn.CrossEntropyLoss()(pred, y).mean()
    true_prob = nn.Softmax(dim=1)(pred)

    pred = net(adv_x)
    adv_prob = nn.LogSoftmax(dim=1)(adv_prob)
    loss_rob = nn.KLDivLoss()(adv_prob, true_prob).mean() # this doesn't work
    # loss_nat = nn.CrossEntropyLoss()(pred, y).mean() # this works, although not desired

    loss = loss_nat + loss_rob/_lambda
    loss.backward()

ptrblck · September 29, 2021, 7:01pm

In your code snippet you are using:

adv_prob = nn.LogSoftmax(dim=1)(adv_prob)

which is undefined in the first execution so I assume it’s a typo?

I’m not able to see the issue in this code snippet so could you check what the difference might be?

net = models.resnet18()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
x = torch.randn(1, 3, 224, 224)
adv_x = torch.randn(1, 3, 224, 224)
y = torch.randint(0, 1000, (1,))

optimizer.zero_grad()

pred = net(x)
loss_nat = nn.CrossEntropyLoss()(pred, y).mean()
true_prob = nn.Softmax(dim=1)(pred)

adv_prob = net(adv_x)
adv_prob = nn.LogSoftmax(dim=1)(adv_prob)
loss_rob = nn.KLDivLoss()(adv_prob, true_prob).mean() # this doesn't work
# loss_nat = nn.CrossEntropyLoss()(pred, y).mean() # this works, although not desired

loss = loss_nat + loss_rob
loss.backward()

fotinidelig · September 30, 2021, 1:15pm

Oh no! This typo was the problem since it was calculated somewhere above, thanks @ptrblck!

Mukil · January 11, 2025, 10:31am

Hii !!

What if there are no individual losses for each input but instead we have a self-supervised loss like MSE. How would the backward pass happen here,

feat1 = model(input1)
feat2 = model(input2)
loss  = mse_loss(feat1,feat2)
loss.backward()

I couldnt understand with respect to which input will the gradients be computed ? Could you please help me out with this @ptrblck Thanks !!

iliasslasri · April 3, 2025, 9:37am

Hey,
There is no difference with your case:

Taking the example above, we can see your case as:

loss1 = model(x)
loss2 = model(y)
f(loss1,loss2).backward()
optimizer.step()