Hi!
I’m implementing DistributedDataParallel in my code. However, when I start it, if I use PyTorch’s lanch module, one task will start training before the others have begun. This is different from without using PyTorch’s launch module, when I’ll see the processes wait on each other before starting the next epoch, etc.
I’m using an implementation that mirrors this Medium article. I’ve been struggling with this issue for two days now, so any help would be extremely appreciated!
Thanks!