DistributedDataParallel deadlock

Hello,

Im trying to use DistributedDataParallel to train a model in a cluster with nodes that each has 2 GPUs K80 and in total Im using 4 GPUs. Im based on this example for the distributed training: https://github.com/pytorch/examples/blob/master/imagenet/main.py.

My problem is that possibly it gets into a deadlock as it never terminates and never returns an error message, the terminal is totally freezing. I ve seen in the documentation of PyTorch the following:

If you plan on using this module with a nccl backend or a gloo
backend (that uses Infiniband), together with a DataLoader that uses
multiple workers, please change the multiprocessing start method to
forkserver (Python 3 only) or spawn. Unfortunately
Gloo (that uses Infiniband) and NCCL2 are not fork safe, and you will
likely experience deadlocks if you don’t change this setting.

and indeed im using gloo backend since DistributedDataParallel supports only gloo. But I tried with a single workers and I still have the same problem.

Also I ve seen in this topic Distributed data parallel freezes without error message that PyTorch developers are currently working in the NCCL2 problem.

Does anybody know if currently there is any solution to the problem? Is there any way to avoid the deadlock and make the distributed training to work? Thank you!

2 Likes

I am facing the same issue but with nccl backend.
It freezes with no error message on model = torch.nn.parallel.DistributedDataParallel(model) line.

1 Like

I still dont have a solution for it. As Im trying to use DistributedDataParallel along with DataLoader that uses multiple workers, I tried setting the multiprocessing start method to ‘spawn’ and ‘forkserver’ (as it is suggested in the PyTorch documntation) but Im still experiencing a deadlock.

Sorry for bumping this but I have the same problem that @nadia described. I’ve been able to run the first example described in https://pytorch.org/tutorials/intermediate/dist_tuto.html with the gloo and nccl backends, which makes me believe this might not be related to an NCCL deadlock, but things just freeze when I do model = torch.nn.parallel.DistributedDataParallel(model).

2 Likes

You mentioned “A single worker” - setting it to “0 workers” runs it in-process, which should not produce the deadlock.

getting the same hang with the pytorch imagenet distributed example
freezes at the line: torch.nn.parallel.DistributedDataParallel(model)

any solutions?

1 Like

I have opened up a new issue on this to discuss since I am facing the same problem.

The more people contribute there on the discussion, the better it will be. This is a significant problem, and it should be visible to the maintainers of PyTorch through the raised issue.

1 Like

I am facing the same problem. DDP with mp.spawn is freezing randomly.

I actually found a solution to my problem of why the workers were freezing at the end of training. I realised that the Data Partition from which mini-batches were created had an imbalanced between workers. Due to this one worker went into deadlock because of gradient syncing.

So one needs to ensure the same number of minibatches by altering batch size to fit with the workers.

Hopefully, we can have a solution for this in the next release based on this design: https://github.com/pytorch/pytorch/issues/38174

1 Like