Backward multiple forward passes

Hello,

I perform two forward passes before a single backward pass as follows:

y_a = net.forward(image_a)
y_b = net.forward(image_b)
loss.backward()
optimizer.step()

I would like to know how concretely does the backward pass take into account both y_a and y_b in the weights update.

I have already checked similar posts on the forum, but none of these posts is clearly explaining the mathematical part of the backward of multiple forwards.

Thank you,

In your current code snippet you don’t show how the loss is calculated, so it’s unclear how the loss.backward() pass would work.
In case you are calculating the loss using both outputs, Autograd will use these two computation graphs to calculate the gradients for all parameters. The actual backward operations depend on the used operations during the forward pass.
E.g. if you are adding both outputs together, the backward pass would call the backward of torch.add.

Thank you for your answer.

The loss function is a little big long. In brief, I summed up the difference between my two predictions.

But, as I have a completely independent forward passes, how the two computation graph are accumulated to calculate the gradients for all parameters ?

Thank’s again.

Based on your description I don’t think they are completely independent, since you are summing them at one point. Autograd will handle them as a standard addition and will backpropagate through the addition as seen here:

x = torch.randn(1, requires_grad=True)
lossX = x * 2

y = torch.randn(1, requires_grad=True)
lossY = y * 3

out = lossX + lossY
out.backward()
print(x.grad)
print(y.grad)

Note that the loss calculation could of course be more complicated than in this simple example.

Yes, these are completely independent forward passes. but at some point for calculating over the loss, you are using the output of both forward passes (directly or indirectly).
So the backpropagation will also follow the calculated loss formula.
in the case of sum, it divides the loss to both forward branches, the ratio will depend on the forward passed outputs.
So it depends on loss function calculation and forwards pass outputs, that how it passes the loss to individual branches.

One more thing,
If I guess you are using the same branch (net.forword) for two inputs image_a and image_b. it will consider only the last loss calculated (with y_b).
what you can do you can get loss for y_a first and then for y_b. add them both and then do loss.backward().
you can think about how loss is calculated in the case of batches vs in the case of a single input image.

Thank you for your answers.

A simplified version of my loss function is loss = (y_a - y_b)

So yes, I am using the two predictions in loss computation. But, as I perform a single zero_grad, I think that this yields to gradients accumulation a not to only take into account the last gradient from the last prediction at one step.