@ptrblck
so unsurity about how to handle the batch norm with accumulated gradient still remains ?
I dint find any blog where i could get a solution ,or confirmation that without adjusting up batchnorm stats we can get benefit out of Grad accumulation.
I’m not aware of any blog and would recommend to look at other implementations, which successfully use gradient accumulation, such as NVIDIA’s DeepLearningExamples.
Based on a quick search it seems that Bert
, Jasper
, FastPitch
, MaskRCNN
, Transformer
, TransformerXL
, and NCF
have a flag to set the gradient accumulation steps. You could take a look at some models and check, if the batchnorm layers (especially the momentum
) are changed or if batchnorm is just not used.
Hy Gopal, Thanks for your explanation but i still dont understand what is the purpose of “loss_sum” in your implementation since your backward is on the “loss” term.
@4bach Good point. loss_sum
doesn’t really do anything. It is just to keep track of average loss for logging purpose. The main step is loss.backward()
.