DistributedDataParallel Freezes on Model Wrapping

This question has already been asked before (DistributedDataParallel deadlock) but it does not seem active anymore.

I am running DistributedDataParallel and both with nccl and gloo backends on Pytorch 1.1 my code freezes on the following line:

model = torch.nn.parallel.DistributedDataParallel(model)

I am not using built-in dataloader so problem with workers not being set to 0 should not be a problem.
Is there any other place where this problem might be coming from?

1 Like

Same problem here, with Pytorch v1.0.1 and NCCL backend. I am running one of the distributed examples from Ignite: https://github.com/pytorch/ignite/blob/master/examples/mnist/mnist_dist.py

Mine freezes on the line:
model = DistributedDataParallel(model, [args.gpu])

Where model is an instance of LeNet5 and args.gpu is 0.

I have opened up a new issue on this to discuss since I am facing the same problem with DDP.