QUESTION: Suppose each process has a different random generator state, when
DistributedDataParallel
is initialized does each process need to have the same parameter values?
No. Rank 0 will broadcast model states to all other ranks when you construct DDP. Code for that is here.
In order to evaluate, on one GPU, can we use
ddp_model.module
?
Yes, this should work.
Can we use something like
EMA
to copy new parameters toddp_model.module
and then restore them after evaluation?
Yes, if you make sure you restored those model param values correctly. Otherwise, if this introduces inconsistency across param values across different processes, DDP will not fix that for you, as DDP only syncs grad instead of params. This might be helpful to explain.
In order to save the model, can we use
ddp_model.module
Yes. And when you restore from the checkpoint, it’s better to reconstruct the DDP instance using the restored module to make sure that DDP starts from a clean state.
Do we need to use
torch.distributed.barrier
so that the other processes don’t continue training while the master evaluates?
It’s recommended this way. But if you are not consuming the checkpoint right away and not worried about timeout due to rank0 is doing more work, this is not necessary. Because the next DDP backward will launch allreduce comm ops, which will sync anyway. Some of this is also explained here.