I have a question about the DistributedDataParallel Module.
In the forward function call of the DDP, there is a
sycn_param call which broadcasts the model parameters from rank 0 to all the other ranks to
keep the same state of the model across all processes.
I want to clarify the actual use of this function call.
Think of a case where all processes started and did initialize the model using a set of known weights. Then across all processes, the weights are uniform. In such a case is this useful? Because in every forward call, this synch is called and it could slow down the training. Please correct me if I am wrong.
Another case is where a user wants to add a mutation to the model weights in each process after a synch call (all-reduce). Think of an instance where a mutation helps to discover diversity in the training. And the all-reduce step ensembles this diversity. In such cases, having this call sync_param would cancel the effect the user expects?
What is the main expectation of the sync call? I referred to the docs, but I couldn’t get a clear picture.