Explain Adam state when using DDP

Is the Adam’s state dict work correct when we use DDP???

I think there is no sync for the optimizer. Do I have to make code for this manually?

Thanks,

DDP should synchronize the gradients in the backward using AllReduce operations, such that each optimizer should use the same model parameters as well as gradients, thus also creating the same internal states.
I don’t think there is a need to synchronize the optimizer’s state, but @mrshenli might correct me, if I’m wrong.

2 Likes

I just found out that LayerNorm with learnable parameter destroy the synchronization between the weights.
SyncBatchNorm is a solution for BatchNorm, but LayerNorm doesn’t have SyncLayerNorm. I just give up using DistributedDataParallel because it may not improve the performance if I add more synchroize point for LayerNorm. In my case, almost all submodule have LayerNorm.
By the way, I really appreciate your help and thank you. :slight_smile:

Yep, I agree with @ptrblck’s comment above.

1 Like