DistributedDataParallel - Master starts without workers

Hi!

I’m implementing DistributedDataParallel in my code. However, when I start it, if I use PyTorch’s lanch module, one task will start training before the others have begun. This is different from without using PyTorch’s launch module, when I’ll see the processes wait on each other before starting the next epoch, etc.

I’m using an implementation that mirrors this Medium article. I’ve been struggling with this issue for two days now, so any help would be extremely appreciated!

Thanks!

When torch.nn.parallel.DistributedDataParallel is initialized with the right distributed context then every iteration should happen in lock step between all processes. If a single process starts going by itself, I think there is something missing in initialization.

Can you share a code snippet how you initialize all of them?