I am trying to understand better how autograd works. Let’s say that I have 2 different batch of images, each of size `b`

: `x1`

and `x2`

, and a general model that outputs a a tensor `b x 64`

. Let’s say that I also have another general model `r`

which output a tensor `b x 1`

. Let’s say that I do something like

```
o1 = m(x1)
o2 = m(x2)
output = r(o2 - o1)
loss = loss_fn(output, labels)
loss.backward()
```

Notice how m is used twice, sequentially, on two different batches. My question is: when the gradient is backpropagated, is the derivative going to be computed with respect to o1, o2, or both? In the latter case, how would it work?