How to sync Optimizer parameters during DDP training


I am writing a customer optimizer that needs to be able to access the global gradient from the previous iteration, as well as sync parameters across optimizer instances.

What would be the best way to do this?

In a non-distributed context, I simply stored the grads in the state variable, as ADAM does with state[‘exp_avg_sq’].

Should I be calling my own all_reduce during the training loop?


Hey @swenson-nick, DDP will guarantee that all ranks sees the same param.grad values (by automatically running allreduce during backward) before getting into optimizer.step(). Will that be sufficient? Or do you have to read the local gradients?