I am trying to understand better how autograd works. Let’s say that I have 2 different batch of images, each of size
x2, and a general model that outputs a a tensor
b x 64. Let’s say that I also have another general model
r which output a tensor
b x 1. Let’s say that I do something like
o1 = m(x1) o2 = m(x2) output = r(o2 - o1) loss = loss_fn(output, labels) loss.backward()
Notice how m is used twice, sequentially, on two different batches. My question is: when the gradient is backpropagated, is the derivative going to be computed with respect to o1, o2, or both? In the latter case, how would it work?