Distributed Data Parallel Param Sync in Forward

Vibhatha_Abeykoon · July 9, 2020, 4:05pm

Hi,

I have a question about the DistributedDataParallel Module.

In the forward function call of the DDP, there is a sycn_param call which broadcasts the model parameters from rank 0 to all the other ranks to keep the same state of the model across all processes.

Forwad Function in DDP

Sync_call

I want to clarify the actual use of this function call.

Think of a case where all processes started and did initialize the model using a set of known weights. Then across all processes, the weights are uniform. In such a case is this useful? Because in every forward call, this synch is called and it could slow down the training. Please correct me if I am wrong.

Another case is where a user wants to add a mutation to the model weights in each process after a synch call (all-reduce). Think of an instance where a mutation helps to discover diversity in the training. And the all-reduce step ensembles this diversity. In such cases, having this call sync_param would cancel the effect the user expects?

What is the main expectation of the sync call? I referred to the docs, but I couldn’t get a clear picture.

Thank You,
Vibhatha.

mrshenli · July 9, 2020, 4:13pm

Hey @Vibhatha_Abeykoon

That _sync_params call is there for two purposes:

Intra rank/process parameter sync: this is only for the legacy single-process multi-device use case, where each process operates on multiple model replicas. And this is not a recommended way to use DDP.
Inter rank/process buffer sync: this does not sync parameters, and this will be skipped if your model does not have buffers (e.g., running_mean in BatchNorm layers).

What is the main expectation of the sync call?

For many use cases, that _sync_params will be a no-op.

Vibhatha_Abeykoon · July 9, 2020, 4:39pm

@mrshenli. Thank you for the response.

I see. The first point is clear to me.

About the second point, so this would be the case to support the functionality
of some specific layers like running_mean in BatchNorm layers. But for
general cases, this is also skipped, so there won’t be such synchs.

Is this an assumption that we can make when we use DDP?

mrshenli · July 9, 2020, 5:46pm

Yep, this is correct.