Explain Adam state when using DDP

sh0416 · May 27, 2020, 12:32pm

Is the Adam’s state dict work correct when we use DDP???

I think there is no sync for the optimizer. Do I have to make code for this manually?

Thanks,

ptrblck · May 28, 2020, 5:39am

DDP should synchronize the gradients in the backward using AllReduce operations, such that each optimizer should use the same model parameters as well as gradients, thus also creating the same internal states.
I don’t think there is a need to synchronize the optimizer’s state, but @mrshenli might correct me, if I’m wrong.

sh0416 · May 28, 2020, 5:55am

I just found out that LayerNorm with learnable parameter destroy the synchronization between the weights.
SyncBatchNorm is a solution for BatchNorm, but LayerNorm doesn’t have SyncLayerNorm. I just give up using DistributedDataParallel because it may not improve the performance if I add more synchroize point for LayerNorm. In my case, almost all submodule have LayerNorm.
By the way, I really appreciate your help and thank you.

mrshenli · May 28, 2020, 3:44pm

Yep, I agree with @ptrblck’s comment above.

osivaz61 · September 7, 2022, 7:34pm

Did you find a way to sync LayerNorm.