Gradient Accumulation in Detectron2

I was wondering whether calling optimizer.zero_grad() after after optimizer.step() has the same effect as the usual order within a single iteration? The reason for this is because I am trying to use gradient accumulation in Detectron2 for my model as memory size is limited. However, in Detectron2 every iteration step is defined as a function, including zeroing out gradients, backpropogation and weight update. Therefore, if I put optimizer.zero_grad at first, as step() is called for a new iteration, it will just zero out all gradients instead of accumulating it. If I were to accumulate gradients for a specified number of iterations, I was thinking to put optimizer.zero_grad() after the optimizer step() like the following:


In this case, I will only zero out graidents at a certain number of iterations. I am wondering if my thought process is correct? Thanks (model is trained with DDP).

Have you managed to solve this issue @Chung-Hao_Ku ?
If not, @ptrblck , any insight ?

Thanks in advance

Yes, you should not zero out the gradients without executing the optimizer.step() method as you would lose this backward pass.
This post explains different approaches for gradient accumulation.