Why autograd will accumate gradients?

May I ask why PyTorch’s autograd will accumulate gradients, if optimizer.zero_grad() is not called?

What is an example of the use case of cumulated gradients?


You could simulate a larger batch size by accumulating the gradients of smaller batches and scaling them with the number of accumulations. This can be useful e.g. if the larger batch size would be beneficial for training but doesn’t fit onto your GPU.
Accumulating the gradients gives you the ability to scale them manually afterwards without enforcing any assumptions on your use case.

1 Like

Thanks for the explanation, @ptrblck.