Different seeds but weights are still init'ed same for different processes in DDP

sio277 already gave the right answer. This behavior is expected.

DDP must keep the weights on different processes synced at the beginning, so operating the same initial weights and synced gradients can lead to the same updated weights, and this makes the distributed training equivalent to the sequential version. Otherwise distributed training will be mathematically wrong.

2 Likes