I am applying amp delay unscale to accumulate gradient. I am also using apex DDP for doing allreduce across processes. I am disabling apexDDP during gradient accumulation using disable_allreduce(self) function.
Just before forward pass of the iteration where I want to reduce my gradient, I call enable_allreduce(self) on apex DDP, set delay_unscale to false, call backward() on scaled loss, clip the gradients and then step the optimizer.
Is this a correct approach?