Distributed data parallel freezes without error message

Hello,

I’m trying to use the distributed data parallel to train a resnet model on mulitple GPU on multiple nodes. The script is adapted from the ImageNet example code. After the script is started, it builds the module on all the GPUs, but it freezes when it tries to copy the data onto GPUs.

During the freezing time, all the GPUs has been allocated memories for the model, but the GPU utilization is 0% and stays at 0% for a long time. In the output file, it complaints
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Here are the scirpts and the corresponding output:

1. Testing the distributed package:

> import torch
> import torch.distributed as dist
> dist.init_process_group(backend='gloo',
>                               init_method = 'file:///home/xqding/tmp/pytorch_dist/shared_file',
>                               world_size = 4)
> print('Hello from process {} (out of {})!'.format(
> 	torch.distributed.get_rank(), torch.distributed.get_world_size()))
> x = torch.Tensor([torch.distributed.get_rank()])
> torch.distributed.all_reduce(x)
> print("value of x: {}".format(x))

This script works fine and this is the output:

Hello from process 3 (out of 4)!
Hello from process 2 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]

value of x:
6
[torch.FloatTensor of size 1]

Hello from process 1 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]

Hello from process 0 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]

2. Resnet model using distributed data parallel

> import torch.nn as nn
> import torch.optim as optim
> from torch.autograd import Variable
> from read_data import *
> from torch.utils.data import Dataset, DataLoader
> from torch.utils.data.distributed import DistributedSampler
> from torch.nn.parallel import DistributedDataParallel
> from torchvision import transforms, utils
> 
> dist = torch.distributed.init_process_group(backend = 'gloo',
>                                             init_method = 'file:///home/xin/shared_file',
>                                             world_size = 3)
> print("Rank:", torch.distributed.get_rank())
> 
> net = models.ResNet(models.resnet.BasicBlock, [3, 4, 6, 3],
>                       num_classes = 100)
> net = torch.nn.parallel.DistributedDataParallel(net.cuda())
> criterion = nn.CrossEntropyLoss().cuda()
> optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9)
> 
> train_data = Dataset(.....)
> train_sampler = DistributedSampler(train_data, num_replicas= 3, rank = torch.distributed.get_rank())
> train_loader = DataLoader(train_data, batch_size = 300,
>                           sampler = train_sampler,
>                           num_workers=5, pin_memory = True)
> num_epoches = 5
> for epoch in range(num_epoches):
>     train_sampler.set_epoch(epoch)
>     running_loss = 0.0
>     for i, data in enumerate(train_loader, 0):
>         print("Step:", i)
>         # get the inputs
>         inputs = data['image']
>         labels = data['labels']
>         print("Here 1")
> 
>         # wrap them in Variable
>         labels = labels.cuda(async = True)
>         print("Here 2")
>         input_var = torch.autograd.Variable(inputs)
>         print("Here 3")
>         labels_var = torch.autograd.Vairable(labels)
> 
>         # zero the parameter gradients
>         optimizer.zero_grad()
>         print("Here 4")
>         # forward + backward + optimize
>         outputs = net(inputs_var)
>         print("Here 5")
>         loss=criterion(outputs,labels_var)
>         loss.backward()
>         optimizer.step()

The output is like this:

Rank: 0
Rank: 1
Rann: 2
Step: 0
Step: 0
Step: 0
Here 1
Here 1
Here 1

It never reaches the print('Here 2'), which means it is freezing at the command labels = labels.cuda(async = True).

While it is freezing, the GPU status on the corresponding nodes is

Thu Sep 28 12:44:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69                 Driver Version: 384.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     On   | 00000000:02:00.0 Off |                  N/A |
| 26%   37C    P2    41W / 180W |    420MiB /  4038MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980     On   | 00000000:82:00.0 Off |                  N/A |
| 26%   30C    P2    41W / 180W |    354MiB /  4038MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     19947    C   /home/xin/apps/anaconda3/bin/python         409MiB |
|    1     19947    C   /home/xin/apps/anaconda3/bin/python         343MiB |
+-----------------------------------------------------------------------------+

Does anyone have some idea what is going on here?
Thanks.

2 Likes

I found where the problem is.
Before running labels = labels.cuda(async = True), labels has to been converted into torch vairable labels = torch.autograd.Variable(labels).

there might be a NCCL deadlock happening in the distributed setting (which is why you saw a freeze). We’ve identified this last week. I am issuing fixes for this.

5 Likes

Hi here is my 2 cents. The deadlock actually seldom happened to me when I use Ethernet(only once actually). And for me the hang happened in loss.backward() I think. Could you add flush=True to your print statement and see it still hangs at the same place when it happens next time?

Hi, I also have the same problem! Is there any solution to this? Thanks!

Is there any update for this?

I have the similar problem. I added flush=True to print statement and found that the hang actually happened in loss.backward(). But the strange thing is that it didn’t happen at the first iteration.

My training code:

    for epoch in range(args.epoch):
        log('EPOCH %d' % epoch)
        if args.distributed:
            train_sampler.set_epoch(epoch)
        for i, data in enumerate(trainloader):
            log('ITER %d' % i)
            net.train()
            inputs, labels = Variable(data[0]), Variable(data[1])
            if args.use_gpu:
                inputs, labels = inputs.cuda(), labels.cuda()
            log('ITER %d DATA LOADED' % i)

            outputs = net.forward(inputs)
            loss = criterion(outputs, labels)
            log('ITER %d FORWARDED' % i)

            optimizer.zero_grad()
            log('ITER %d ZERO_GRAD' % i)

            loss.backward()
            log('ITER %d BACKWARDED' % i)

            optimizer.step()
            log('ITER %d STEP' % i)

The output is:

2018-04-17_10:19:18 EPOCH 0
2018-04-17_10:19:21 ITER 0
2018-04-17_10:19:21 ITER 0 DATA LOADED
2018-04-17_10:19:30 ITER 0 FORWARDED
2018-04-17_10:19:30 ITER 0 ZERO_GRAD
2018-04-17_10:19:32 ITER 0 BACKWARDED
2018-04-17_10:19:32 ITER 0 STEP
2018-04-17_10:19:32 ITER 1
2018-04-17_10:19:32 ITER 1 DATA LOADED
2018-04-17_10:19:33 ITER 1 FORWARDED
2018-04-17_10:19:33 ITER 1 ZERO_GRAD
1 Like

You can try to change your batchsize, my distributed training works with batchsize 48, and freezes with batch size 32.

I am facing a similar sort of an issue, therefore, opened up an issue on PyTorch repo regarding this. If you are still facing the problem, it would be nice to contribute to the discussion there so that the PyTorch maintainers are aware of this problem with DDP.

Env:

  • Ubuntu 18.04
  • Pytorch 1.6.0
  • CUDA 10.1

Actually, I am using Docker image gemfield/pytorch:1.6.0-devel which stated in https://github.com/DeepVAC/deepvac (same with above env), and use PyTorch DDP (by use the class DeepvacDDP in https://github.com/DeepVAC/deepvac/blob/master/deepvac/syszux_deepvac.py) to train my model, which the code worked perfect yesterday. But today when I launch the train program again, the DDP is stucked in loss.backward(), with cpu 100% and GPU 100%。
There has no code change and docker container change since yesterday, except the Ubuntu host got a system update today:

gemfield@ai03:~$ cat /var/log/apt/history.log | grep -C 3 nvidia

Start-Date: 2020-09-03  06:44:01
Commandline: /usr/bin/unattended-upgrade
Install: linux-modules-nvidia-440-5.4.0-45-generic:amd64 (5.4.0-45.49, automatic)
Upgrade: linux-modules-nvidia-440-generic-hwe-20.04:amd64 (5.4.0-42.46, 5.4.0-45.49)
End-Date: 2020-09-03  06:44:33

Obviously, the nvidia driver got update from 440.64 to 440.100, and I think these info may be useful for somebody.

@smth @ptrblck was this issue ever solved. I am facing NCCL deadlock issues in DistributedDataParallel.

Also, I face this issue with some particular architectures only and I don’t understand what does the architecture has to do with the NCCL deadlock?

Could you create a new topic including an executable code snippet to reproduce the issue as well as information about your setup (PyTorch, CUDA, cudnn, NCCL version, used GPU, OS etc.)?

1 Like

@ptrblck, the problem is that this issue is not reproducible. It is totally random whether the training will face a deadlock or not. For example, I had three network architectures, and the day before yesterday, two of the three were suffering from NCCL deadlock while the other one was not. Yesterday when I tried it again, I had this issue with only one of the architectures and not with the other. I didn’t change a single line in the code.

I am using Pytorch 1.5.0 on Ubuntu 18.04. Also,

>>> torch.cuda.nccl.version()
2408
>>> torch.backends.cudnn.version()
7605

I am training my models on RTX 2080ti and in the multi-gpu set up, I have tried using 2,4 and 8 GPUs but the deadlock issue persists.

I am training image classification models and the training freezes on the line

images = images.cuda(non_blocking=True)

I have already experimented with non_blocking=False and I am sure that it is not the problem.

You could try to use the nightly binary with NCCL 2.7.6 and see if you are still facing this issue.

1 Like

I installed the pytorch-nightly using the following command

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch-nightly

The NCCL version is 2.7.6 now.

The training is working fine so far. Thanks for the help @ptrblck.

However, I want to know what was the issue that was triggering the NCCL deadlock.

It’s unclear to me that NCCL caused the deadlock. Without a code snippet to reproduce it, I cannot be very helpful in isolating the issue.

Hi~ I have the same problem. So It need to update nccl to 2.7.6 and install pytorch-nightly.
But I use conda to install pytorch-nightly, The nccl’s version is also 2.4.8.

Hi @Feywell, I did not quite get what issue are you facing.

If you have the same problem (i.e, training freezes because of some deadlock) then try upgrading your pytorch to the latest pytorch-nightly. It uses the NCCL submodule version 2.7.6

To install pytorch-nightly from conda, refer to pytorch official website.

If the issue is that you installed pytorch-nightly using conda but the nccl version is still 2.4.8 then can you please mention the command that you are using to install pytorch-nightly from conda?

Thanks for your reply.
I update pytorch to 1.7 sucessfully. And ncc version is 2.7.6 now. But my code is also dead without errors.
It is weired

@iamshnik @ptrblck
This is my problem in detail
Is it the same problem?

@Feywell it seems like you are facing the same issue. In my case NCCL 2.7.6 resolved the issue and I was able to train my models. Infact, I had almost similar system settings:

Ubuntu 18.04
2080ti
CUDA 10.2
pytorch-nightly 1.7
python 3.7

Are you using pytorch-nightly or just pytorch? If you are not using pytorch-nightly, please try using that.