How to implement accumulated gradient?

@ptrblck
so unsurity about how to handle the batch norm with accumulated gradient still remains ?
I dint find any blog where i could get a solution ,or confirmation that without adjusting up batchnorm stats we can get benefit out of Grad accumulation.

1 Like

I’m not aware of any blog and would recommend to look at other implementations, which successfully use gradient accumulation, such as NVIDIA’s DeepLearningExamples.
Based on a quick search it seems that Bert, Jasper, FastPitch, MaskRCNN, Transformer, TransformerXL, and NCF have a flag to set the gradient accumulation steps. You could take a look at some models and check, if the batchnorm layers (especially the momentum) are changed or if batchnorm is just not used.

1 Like

Hy Gopal, Thanks for your explanation but i still dont understand what is the purpose of “loss_sum” in your implementation since your backward is on the “loss” term.

@4bach Good point. loss_sum doesn’t really do anything. It is just to keep track of average loss for logging purpose. The main step is loss.backward().

1 Like