Model and optimizer parameters synchronization at start

I am reading the page Distributed Data Parallel — PyTorch 2.5 documentation . Although, it was written a while ago, I think the basic still applies. Looking at the example code in that page, I am wondering how the model and optimizer parameters of each rank are made exactly same ? Else it would likely not converge, right ?

From the DDP Internal Design Doc:

Construction: The DDP constructor takes a reference to the local module, and broadcasts state_dict() from the process with rank 0 to all other processes in the group to make sure that all model replicas start from the exact same state.

Thanks. Follow on question is why this broadcast is needed ? If the random seed/s are set to the same value, then should not the state of each rank be identical at the start ?

The user would need to make sure the same seeds are used and exactly the same order of calls into the pseudorandom number generator is made, which can be challenging. Broadcasting the state_dict from rank0 is a simple solution.

What could cause differences in call to PNRG in the ranks ? Would be good to have an idea of the sources of non-determinism. Any effort in PyTorch to remove non-determinism ?

Conditional user code, e.g. shuffling dataset indices depending on the rankId, but of course the possibilities to make calls into the PRNG are unlimited.
Expecting users to guarantee exactly the same calls into the PRNG are made is unreasonable and the solution is for a single broadcast of the state_dict at the beginning of the training causing negligible overhead.

Different calls into the PRNG are not causing non-determinism, since the random numbers are still deterministic but different.
If you want to produce deterministic results, please refer to the Reproducibility docs.

Thanks. The randomness doc appears to mainly address simple single process apps. It will be good to have something similar for a real app in a distributed environment.