Run a network twice before backpropagating

I am trying to understand better how autograd works. Let’s say that I have 2 different batch of images, each of size b: x1 and x2, and a general model that outputs a a tensor b x 64. Let’s say that I also have another general model r which output a tensor b x 1. Let’s say that I do something like

o1 = m(x1)
o2 = m(x2)
output = r(o2 - o1)
loss = loss_fn(output, labels)
loss.backward()

Notice how m is used twice, sequentially, on two different batches. My question is: when the gradient is backpropagated, is the derivative going to be computed with respect to o1, o2, or both? In the latter case, how would it work?

The derivative will compute with respect to both o1 and o2.
Lets call d = o2 - o1.
The derivative of o1 will be minus the gradient of the loss with respect to d,
And the derivative of o2 will be the gradient of the loss with respect to d.

are these gradients summed up when performing the update step?

Yes the gradients will be summed up accordingly.

1 Like