Does PyTorch average or sum gradients over a minibatch?

For example when you retrieve the gradients like so:

loss = F.nll_loss(output, target)
for key, value in model.named_parameters():
    mygrad = value.grad

is mygrad the sum of gradients over the minibatch or the average?


Well, it depends on the loss function you use right? All autograd does is just to calculate the gradient, it has no notion of batching, and I don’t see how it can have different behavior with different batching mechanism. If you look at the doc of the loss function here (, you will notice that size_average=True in your case. It means that the loss is averaged across the batch.