Backward multiple forward passes

Hello,

I perform two forward passes before a single backward pass as follows:

y_a = net.forward(image_a)
y_b = net.forward(image_b)
loss.backward()
optimizer.step()

I would like to know how concretely does the backward pass take into account both y_a and y_b in the weights update.

I have already checked similar posts on the forum, but none of these posts is clearly explaining the mathematical part of the backward of multiple forwards.

Thank you,

In your current code snippet you don’t show how the loss is calculated, so it’s unclear how the loss.backward() pass would work.
In case you are calculating the loss using both outputs, Autograd will use these two computation graphs to calculate the gradients for all parameters. The actual backward operations depend on the used operations during the forward pass.
E.g. if you are adding both outputs together, the backward pass would call the backward of torch.add.

Thank you for your answer.

The loss function is a little big long. In brief, I summed up the difference between my two predictions.

But, as I have a completely independent forward passes, how the two computation graph are accumulated to calculate the gradients for all parameters ?

Thank’s again.

Based on your description I don’t think they are completely independent, since you are summing them at one point. Autograd will handle them as a standard addition and will backpropagate through the addition as seen here:

x = torch.randn(1, requires_grad=True)
lossX = x * 2

y = torch.randn(1, requires_grad=True)
lossY = y * 3

out = lossX + lossY
out.backward()
print(x.grad)
print(y.grad)

Note that the loss calculation could of course be more complicated than in this simple example.

Yes, these are completely independent forward passes. but at some point for calculating over the loss, you are using the output of both forward passes (directly or indirectly).
So the backpropagation will also follow the calculated loss formula.
in the case of sum, it divides the loss to both forward branches, the ratio will depend on the forward passed outputs.
So it depends on loss function calculation and forwards pass outputs, that how it passes the loss to individual branches.

One more thing,
If I guess you are using the same branch (net.forword) for two inputs image_a and image_b. it will consider only the last loss calculated (with y_b).
what you can do you can get loss for y_a first and then for y_b. add them both and then do loss.backward().
you can think about how loss is calculated in the case of batches vs in the case of a single input image.

Thank you for your answers.

A simplified version of my loss function is loss = (y_a - y_b)

So yes, I am using the two predictions in loss computation. But, as I perform a single zero_grad, I think that this yields to gradients accumulation a not to only take into account the last gradient from the last prediction at one step.

Hi!
I want to ask a maybe trivial question regarding the issue with two forward passes and one backward pass. Is doing:

loss1 = loss_fun1(model(x))
loss2 = loss_fun2(model(y))
(loss1+loss2).backward()
optimizer.step()

the same as doing:

loss1 = loss_fun1(model(x))
loss1.backward()
loss2 = loss_fun2(model(y))
loss2.backward()
optimizer.step()

?
I’ve tried both cases but neither seems to work and throws the popular error: Trying to backward through the graph a second time etc.
Also, trying retain_graph=True doesn’t make it work either.

In my understanding shouldn’t the gradient of both the computational graphs created from loss1 and loss2 be accumulated? What is it that I’m missing?

1 Like

Yes, if no dependency between the iterations is introduced, both approaches would work as seen here:

# setup
model = models.resnet18()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
x = torch.randn(1, 3, 224, 224)
y = torch.randn(1, 3, 224, 224)

# 1st approach
loss1 = model(x).mean()
loss2 = model(y).mean()
(loss1+loss2).backward()
optimizer.step()

# 2nd approach
loss1 = model(x).mean()
loss1.backward()
loss2 = model(y).mean()
loss2.backward()
optimizer.step()
1 Like

Hmm your code works great, but although using approach #1 I still get the error:
Trying to backward through the graph a second time (or directly access saved variables after they have already been freed)…
except if I change the second loss function according to the comment

Maybe it’s triggered due to the loss functions that I’m using?

    net.train()
    optimizer.zero_grad()

    pred = net(x)
    loss_nat = nn.CrossEntropyLoss()(pred, y).mean()
    true_prob = nn.Softmax(dim=1)(pred)

    pred = net(adv_x)
    adv_prob = nn.LogSoftmax(dim=1)(adv_prob)
    loss_rob = nn.KLDivLoss()(adv_prob, true_prob).mean() # this doesn't work
    # loss_nat = nn.CrossEntropyLoss()(pred, y).mean() # this works, although not desired

    loss = loss_nat + loss_rob/_lambda
    loss.backward()

In your code snippet you are using:

adv_prob = nn.LogSoftmax(dim=1)(adv_prob)

which is undefined in the first execution so I assume it’s a typo?

I’m not able to see the issue in this code snippet so could you check what the difference might be?

net = models.resnet18()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
x = torch.randn(1, 3, 224, 224)
adv_x = torch.randn(1, 3, 224, 224)
y = torch.randint(0, 1000, (1,))

optimizer.zero_grad()

pred = net(x)
loss_nat = nn.CrossEntropyLoss()(pred, y).mean()
true_prob = nn.Softmax(dim=1)(pred)

adv_prob = net(adv_x)
adv_prob = nn.LogSoftmax(dim=1)(adv_prob)
loss_rob = nn.KLDivLoss()(adv_prob, true_prob).mean() # this doesn't work
# loss_nat = nn.CrossEntropyLoss()(pred, y).mean() # this works, although not desired

loss = loss_nat + loss_rob
loss.backward()

Oh no! This typo was the problem since it was calculated somewhere above, thanks @ptrblck!