How to implement accumulated gradient？

Jaideep_Valani · July 27, 2020, 4:39pm

@ptrblck
so unsurity about how to handle the batch norm with accumulated gradient still remains ?
I dint find any blog where i could get a solution ,or confirmation that without adjusting up batchnorm stats we can get benefit out of Grad accumulation.

ptrblck · July 27, 2020, 10:58pm

I’m not aware of any blog and would recommend to look at other implementations, which successfully use gradient accumulation, such as NVIDIA’s DeepLearningExamples.
Based on a quick search it seems that Bert, Jasper, FastPitch, MaskRCNN, Transformer, TransformerXL, and NCF have a flag to set the gradient accumulation steps. You could take a look at some models and check, if the batchnorm layers (especially the momentum) are changed or if batchnorm is just not used.

4bach · March 26, 2021, 4:44pm

Hy Gopal, Thanks for your explanation but i still dont understand what is the purpose of “loss_sum” in your implementation since your backward is on the “loss” term.

Gopal_Sharma · March 26, 2021, 11:07pm

@4bach Good point. loss_sum doesn’t really do anything. It is just to keep track of average loss for logging purpose. The main step is loss.backward().