Gradient Accumulation in Detectron2

Chung-Hao_Ku · May 19, 2022, 6:09pm

I was wondering whether calling optimizer.zero_grad() after after optimizer.step() has the same effect as the usual order within a single iteration? The reason for this is because I am trying to use gradient accumulation in Detectron2 for my model as memory size is limited. However, in Detectron2 every iteration step is defined as a function, including zeroing out gradients, backpropogation and weight update. Therefore, if I put optimizer.zero_grad at first, as step() is called for a new iteration, it will just zero out all gradients instead of accumulating it. If I were to accumulate gradients for a specified number of iterations, I was thinking to put optimizer.zero_grad() after the optimizer step() like the following:

In this case, I will only zero out graidents at a certain number of iterations. I am wondering if my thought process is correct? Thanks (model is trained with DDP).

Ahmad_Zobair_Surosh · August 17, 2022, 7:01am

Hi
Have you managed to solve this issue @Chung-Hao_Ku ?
If not, @ptrblck , any insight ?

Thanks in advance

ptrblck · August 17, 2022, 7:26am

Yes, you should not zero out the gradients without executing the optimizer.step() method as you would lose this backward pass.
This post explains different approaches for gradient accumulation.