I am trying to understand better how autograd works. Let’s say that I have 2 different batch of images, each of size b
: x1
and x2
, and a general model that outputs a a tensor b x 64
. Let’s say that I also have another general model r
which output a tensor b x 1
. Let’s say that I do something like
o1 = m(x1)
o2 = m(x2)
output = r(o2 - o1)
loss = loss_fn(output, labels)
loss.backward()
Notice how m is used twice, sequentially, on two different batches. My question is: when the gradient is backpropagated, is the derivative going to be computed with respect to o1, o2, or both? In the latter case, how would it work?