Distributed data parallel freezes without error message

Hi here is my 2 cents. The deadlock actually seldom happened to me when I use Ethernet(only once actually). And for me the hang happened in loss.backward() I think. Could you add flush=True to your print statement and see it still hangs at the same place when it happens next time?

Hi, I also have the same problem! Is there any solution to this? Thanks!

Is there any update for this?

I have the similar problem. I added flush=True to print statement and found that the hang actually happened in loss.backward(). But the strange thing is that it didn’t happen at the first iteration.

My training code:

    for epoch in range(args.epoch):
        log('EPOCH %d' % epoch)
        if args.distributed:
        for i, data in enumerate(trainloader):
            log('ITER %d' % i)
            inputs, labels = Variable(data[0]), Variable(data[1])
            if args.use_gpu:
                inputs, labels = inputs.cuda(), labels.cuda()
            log('ITER %d DATA LOADED' % i)

            outputs = net.forward(inputs)
            loss = criterion(outputs, labels)
            log('ITER %d FORWARDED' % i)

            log('ITER %d ZERO_GRAD' % i)

            log('ITER %d BACKWARDED' % i)

            log('ITER %d STEP' % i)

The output is:

2018-04-17_10:19:18 EPOCH 0
2018-04-17_10:19:21 ITER 0
2018-04-17_10:19:21 ITER 0 DATA LOADED
2018-04-17_10:19:30 ITER 0 FORWARDED
2018-04-17_10:19:30 ITER 0 ZERO_GRAD
2018-04-17_10:19:32 ITER 0 BACKWARDED
2018-04-17_10:19:32 ITER 0 STEP
2018-04-17_10:19:32 ITER 1
2018-04-17_10:19:32 ITER 1 DATA LOADED
2018-04-17_10:19:33 ITER 1 FORWARDED
2018-04-17_10:19:33 ITER 1 ZERO_GRAD
1 Like

You can try to change your batchsize, my distributed training works with batchsize 48, and freezes with batch size 32.

I am facing a similar sort of an issue, therefore, opened up an issue on PyTorch repo regarding this. If you are still facing the problem, it would be nice to contribute to the discussion there so that the PyTorch maintainers are aware of this problem with DDP.


  • Ubuntu 18.04
  • Pytorch 1.6.0
  • CUDA 10.1

Actually, I am using Docker image gemfield/pytorch:1.6.0-devel which stated in https://github.com/DeepVAC/deepvac (same with above env), and use PyTorch DDP (by use the class DeepvacDDP in https://github.com/DeepVAC/deepvac/blob/master/deepvac/syszux_deepvac.py) to train my model, which the code worked perfect yesterday. But today when I launch the train program again, the DDP is stucked in loss.backward(), with cpu 100% and GPU 100%。
There has no code change and docker container change since yesterday, except the Ubuntu host got a system update today:

gemfield@ai03:~$ cat /var/log/apt/history.log | grep -C 3 nvidia

Start-Date: 2020-09-03  06:44:01
Commandline: /usr/bin/unattended-upgrade
Install: linux-modules-nvidia-440-5.4.0-45-generic:amd64 (5.4.0-45.49, automatic)
Upgrade: linux-modules-nvidia-440-generic-hwe-20.04:amd64 (5.4.0-42.46, 5.4.0-45.49)
End-Date: 2020-09-03  06:44:33

Obviously, the nvidia driver got update from 440.64 to 440.100, and I think these info may be useful for somebody.

@smth @ptrblck was this issue ever solved. I am facing NCCL deadlock issues in DistributedDataParallel.

Also, I face this issue with some particular architectures only and I don’t understand what does the architecture has to do with the NCCL deadlock?

Could you create a new topic including an executable code snippet to reproduce the issue as well as information about your setup (PyTorch, CUDA, cudnn, NCCL version, used GPU, OS etc.)?

1 Like

@ptrblck, the problem is that this issue is not reproducible. It is totally random whether the training will face a deadlock or not. For example, I had three network architectures, and the day before yesterday, two of the three were suffering from NCCL deadlock while the other one was not. Yesterday when I tried it again, I had this issue with only one of the architectures and not with the other. I didn’t change a single line in the code.

I am using Pytorch 1.5.0 on Ubuntu 18.04. Also,

>>> torch.cuda.nccl.version()
>>> torch.backends.cudnn.version()

I am training my models on RTX 2080ti and in the multi-gpu set up, I have tried using 2,4 and 8 GPUs but the deadlock issue persists.

I am training image classification models and the training freezes on the line

images = images.cuda(non_blocking=True)

I have already experimented with non_blocking=False and I am sure that it is not the problem.

1 Like

You could try to use the nightly binary with NCCL 2.7.6 and see if you are still facing this issue.

1 Like

I installed the pytorch-nightly using the following command

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch-nightly

The NCCL version is 2.7.6 now.

The training is working fine so far. Thanks for the help @ptrblck.

However, I want to know what was the issue that was triggering the NCCL deadlock.

It’s unclear to me that NCCL caused the deadlock. Without a code snippet to reproduce it, I cannot be very helpful in isolating the issue.

Hi~ I have the same problem. So It need to update nccl to 2.7.6 and install pytorch-nightly.
But I use conda to install pytorch-nightly, The nccl’s version is also 2.4.8.

Hi @Feywell, I did not quite get what issue are you facing.

If you have the same problem (i.e, training freezes because of some deadlock) then try upgrading your pytorch to the latest pytorch-nightly. It uses the NCCL submodule version 2.7.6

To install pytorch-nightly from conda, refer to pytorch official website.

If the issue is that you installed pytorch-nightly using conda but the nccl version is still 2.4.8 then can you please mention the command that you are using to install pytorch-nightly from conda?

Thanks for your reply.
I update pytorch to 1.7 sucessfully. And ncc version is 2.7.6 now. But my code is also dead without errors.
It is weired

@iamshnik @ptrblck
This is my problem in detail
Is it the same problem?

@Feywell it seems like you are facing the same issue. In my case NCCL 2.7.6 resolved the issue and I was able to train my models. Infact, I had almost similar system settings:

Ubuntu 18.04
CUDA 10.2
pytorch-nightly 1.7
python 3.7

Are you using pytorch-nightly or just pytorch? If you are not using pytorch-nightly, please try using that.

I use pytorch-nightly 1.7 and nccl 2.7.6, but the problem is also exist. I cannot distributed training.

Can you help to reproduce this issue? Maybe, try to run a part of your code on Google Colab and share the link if you face the same problem again.

Ok, I will try to use Colab reproduce this issue.