Understanding when and how PyTorch calculates the average gradient when averaging loss for batches

jkking · September 21, 2022, 12:06pm

I understand that a main motivation for updating our networks via batches (or mini-batches) is that PyTorch is able to update using the average gradient found from all losses in the batch. This then makes the distinction between “averaging the gradient” and “averaging the loss,” but leaves me a bit confused since generally we manipulate losses.

Is calling .mean() on a loss tensor and then .backward() then same as backpropagating on the average gradient, or is there more nuance? And is this always true?

I ask “is it always” as I’m thinking particularly of a RL application I’m toying with where I don’t have true batches. Rather I calculate losses per episode (zero-dimensional, just number with grad), and I could turn those losses into a pseudo-batch by appending them to a list. I would then wonder if doing torch.tensor() on that list, tensor.mean(), and my_pseudo_batch_loss.backward() provides the same “average gradient” functionality of typical batch updates.