Proper DistributedDataParallel Usage

QUESTION: Suppose each process has a different random generator state, when DistributedDataParallel is initialized does each process need to have the same parameter values?

No. Rank 0 will broadcast model states to all other ranks when you construct DDP. Code for that is here.

In order to evaluate, on one GPU, can we use ddp_model.module?

Yes, this should work.

Can we use something like EMA to copy new parameters to ddp_model.module and then restore them after evaluation?

Yes, if you make sure you restored those model param values correctly. Otherwise, if this introduces inconsistency across param values across different processes, DDP will not fix that for you, as DDP only syncs grad instead of params. This might be helpful to explain.

In order to save the model, can we use ddp_model.module

Yes. And when you restore from the checkpoint, it’s better to reconstruct the DDP instance using the restored module to make sure that DDP starts from a clean state.

Do we need to use torch.distributed.barrier so that the other processes don’t continue training while the master evaluates?

It’s recommended this way. But if you are not consuming the checkpoint right away and not worried about timeout due to rank0 is doing more work, this is not necessary. Because the next DDP backward will launch allreduce comm ops, which will sync anyway. Some of this is also explained here.

4 Likes