Is the Adam’s state dict work correct when we use DDP???
I think there is no sync for the optimizer. Do I have to make code for this manually?
Thanks,
Is the Adam’s state dict work correct when we use DDP???
I think there is no sync for the optimizer. Do I have to make code for this manually?
Thanks,
DDP should synchronize the gradients in the backward using AllReduce operations, such that each optimizer should use the same model parameters as well as gradients, thus also creating the same internal states.
I don’t think there is a need to synchronize the optimizer’s state, but @mrshenli might correct me, if I’m wrong.
I just found out that LayerNorm
with learnable parameter destroy the synchronization between the weights.
SyncBatchNorm
is a solution for BatchNorm
, but LayerNorm
doesn’t have SyncLayerNorm
. I just give up using DistributedDataParallel
because it may not improve the performance if I add more synchroize point for LayerNorm
. In my case, almost all submodule have LayerNorm
.
By the way, I really appreciate your help and thank you.
Did you find a way to sync LayerNorm.