I understand that a main motivation for updating our networks via batches (or mini-batches) is that PyTorch is able to update using the average gradient found from all losses in the batch. This then makes the distinction between “averaging the gradient” and “averaging the loss,” but leaves me a bit confused since generally we manipulate losses.
.mean() on a loss tensor and then
.backward() then same as backpropagating on the average gradient, or is there more nuance? And is this always true?
I ask “is it always” as I’m thinking particularly of a RL application I’m toying with where I don’t have true batches. Rather I calculate losses per episode (zero-dimensional, just number with grad), and I could turn those losses into a pseudo-batch by appending them to a list. I would then wonder if doing
torch.tensor() on that list,
my_pseudo_batch_loss.backward() provides the same “average gradient” functionality of typical batch updates.