DistributedDataParallel deadlock

Evangelos_Kazakos · March 12, 2018, 2:43pm

Hello,

Im trying to use DistributedDataParallel to train a model in a cluster with nodes that each has 2 GPUs K80 and in total Im using 4 GPUs. Im based on this example for the distributed training: https://github.com/pytorch/examples/blob/master/imagenet/main.py.

My problem is that possibly it gets into a deadlock as it never terminates and never returns an error message, the terminal is totally freezing. I ve seen in the documentation of PyTorch the following:

If you plan on using this module with a nccl backend or a gloo
backend (that uses Infiniband), together with a DataLoader that uses
multiple workers, please change the multiprocessing start method to
forkserver (Python 3 only) or spawn. Unfortunately
Gloo (that uses Infiniband) and NCCL2 are not fork safe, and you will
likely experience deadlocks if you don’t change this setting.

and indeed im using gloo backend since DistributedDataParallel supports only gloo. But I tried with a single workers and I still have the same problem.

Also I ve seen in this topic Distributed data parallel freezes without error message that PyTorch developers are currently working in the NCCL2 problem.

Does anybody know if currently there is any solution to the problem? Is there any way to avoid the deadlock and make the distributed training to work? Thank you!

nadia · May 17, 2018, 9:46pm

I am facing the same issue but with nccl backend.
It freezes with no error message on model = torch.nn.parallel.DistributedDataParallel(model) line.

Evangelos_Kazakos · May 23, 2018, 5:18pm

I still dont have a solution for it. As Im trying to use DistributedDataParallel along with DataLoader that uses multiple workers, I tried setting the multiprocessing start method to ‘spawn’ and ‘forkserver’ (as it is suggested in the PyTorch documntation) but Im still experiencing a deadlock.

Sandeep_Subramanian · February 10, 2019, 6:12am

Sorry for bumping this but I have the same problem that @nadia described. I’ve been able to run the first example described in https://pytorch.org/tutorials/intermediate/dist_tuto.html with the gloo and nccl backends, which makes me believe this might not be related to an NCCL deadlock, but things just freeze when I do model = torch.nn.parallel.DistributedDataParallel(model).

Dessix · April 18, 2019, 10:43pm

You mentioned “A single worker” - setting it to “0 workers” runs it in-process, which should not produce the deadlock.

mohsaied · April 30, 2019, 1:33pm

getting the same hang with the pytorch imagenet distributed example
freezes at the line: torch.nn.parallel.DistributedDataParallel(model)

any solutions?

Shoaib_Ahmed_Siddiqu · July 13, 2019, 11:25pm

I have opened up a new issue on this to discuss since I am facing the same problem.

Shoaib_Ahmed_Siddiqu · July 13, 2019, 11:30pm

The more people contribute there on the discussion, the better it will be. This is a significant problem, and it should be visible to the maintainers of PyTorch through the raised issue.

valayDave · July 31, 2020, 1:40am

I am facing the same problem. DDP with mp.spawn is freezing randomly.

valayDave · August 1, 2020, 5:18am

I actually found a solution to my problem of why the workers were freezing at the end of training. I realised that the Data Partition from which mini-batches were created had an imbalanced between workers. Due to this one worker went into deadlock because of gradient syncing.

So one needs to ensure the same number of minibatches by altering batch size to fit with the workers.

mrshenli · August 3, 2020, 2:47pm

Hopefully, we can have a solution for this in the next release based on this design: https://github.com/pytorch/pytorch/issues/38174