Why autograd will accumate gradients?

oat · January 4, 2021, 6:46am

May I ask why PyTorch’s autograd will accumulate gradients, if optimizer.zero_grad() is not called?

What is an example of the use case of cumulated gradients?

Thanks.

ptrblck · January 4, 2021, 7:40am

You could simulate a larger batch size by accumulating the gradients of smaller batches and scaling them with the number of accumulations. This can be useful e.g. if the larger batch size would be beneficial for training but doesn’t fit onto your GPU.
Accumulating the gradients gives you the ability to scale them manually afterwards without enforcing any assumptions on your use case.

oat · January 5, 2021, 3:38am

Thanks for the explanation, @ptrblck.