Distributed data parallel freezes without error message


(Xinqiang Ding) #1

Hello,

I’m trying to use the distributed data parallel to train a resnet model on mulitple GPU on multiple nodes. The script is adapted from the ImageNet example code. After the script is started, it builds the module on all the GPUs, but it freezes when it tries to copy the data onto GPUs.

During the freezing time, all the GPUs has been allocated memories for the model, but the GPU utilization is 0% and stays at 0% for a long time. In the output file, it complaints
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Here are the scirpts and the corresponding output:

1. Testing the distributed package:

> import torch
> import torch.distributed as dist
> dist.init_process_group(backend='gloo',
>                               init_method = 'file:///home/xqding/tmp/pytorch_dist/shared_file',
>                               world_size = 4)
> print('Hello from process {} (out of {})!'.format(
> 	torch.distributed.get_rank(), torch.distributed.get_world_size()))
> x = torch.Tensor([torch.distributed.get_rank()])
> torch.distributed.all_reduce(x)
> print("value of x: {}".format(x))

This script works fine and this is the output:

Hello from process 3 (out of 4)!
Hello from process 2 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]

value of x:
6
[torch.FloatTensor of size 1]

Hello from process 1 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]

Hello from process 0 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]

2. Resnet model using distributed data parallel

> import torch.nn as nn
> import torch.optim as optim
> from torch.autograd import Variable
> from read_data import *
> from torch.utils.data import Dataset, DataLoader
> from torch.utils.data.distributed import DistributedSampler
> from torch.nn.parallel import DistributedDataParallel
> from torchvision import transforms, utils
> 
> dist = torch.distributed.init_process_group(backend = 'gloo',
>                                             init_method = 'file:///home/xin/shared_file',
>                                             world_size = 3)
> print("Rank:", torch.distributed.get_rank())
> 
> net = models.ResNet(models.resnet.BasicBlock, [3, 4, 6, 3],
>                       num_classes = 100)
> net = torch.nn.parallel.DistributedDataParallel(net.cuda())
> criterion = nn.CrossEntropyLoss().cuda()
> optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9)
> 
> train_data = Dataset(.....)
> train_sampler = DistributedSampler(train_data, num_replicas= 3, rank = torch.distributed.get_rank())
> train_loader = DataLoader(train_data, batch_size = 300,
>                           sampler = train_sampler,
>                           num_workers=5, pin_memory = True)
> num_epoches = 5
> for epoch in range(num_epoches):
>     train_sampler.set_epoch(epoch)
>     running_loss = 0.0
>     for i, data in enumerate(train_loader, 0):
>         print("Step:", i)
>         # get the inputs
>         inputs = data['image']
>         labels = data['labels']
>         print("Here 1")
> 
>         # wrap them in Variable
>         labels = labels.cuda(async = True)
>         print("Here 2")
>         input_var = torch.autograd.Variable(inputs)
>         print("Here 3")
>         labels_var = torch.autograd.Vairable(labels)
> 
>         # zero the parameter gradients
>         optimizer.zero_grad()
>         print("Here 4")
>         # forward + backward + optimize
>         outputs = net(inputs_var)
>         print("Here 5")
>         loss=criterion(outputs,labels_var)
>         loss.backward()
>         optimizer.step()

The output is like this:

Rank: 0
Rank: 1
Rann: 2
Step: 0
Step: 0
Step: 0
Here 1
Here 1
Here 1

It never reaches the print('Here 2'), which means it is freezing at the command labels = labels.cuda(async = True).

While it is freezing, the GPU status on the corresponding nodes is

Thu Sep 28 12:44:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69                 Driver Version: 384.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     On   | 00000000:02:00.0 Off |                  N/A |
| 26%   37C    P2    41W / 180W |    420MiB /  4038MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980     On   | 00000000:82:00.0 Off |                  N/A |
| 26%   30C    P2    41W / 180W |    354MiB /  4038MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     19947    C   /home/xin/apps/anaconda3/bin/python         409MiB |
|    1     19947    C   /home/xin/apps/anaconda3/bin/python         343MiB |
+-----------------------------------------------------------------------------+

Does anyone have some idea what is going on here?
Thanks.


(Xinqiang Ding) #2

I found where the problem is.
Before running labels = labels.cuda(async = True), labels has to been converted into torch vairable labels = torch.autograd.Variable(labels).


#3

there might be a NCCL deadlock happening in the distributed setting (which is why you saw a freeze). We’ve identified this last week. I am issuing fixes for this.


DistributedDataParallel deadlock
(Ailing Zhang) #4

Hi here is my 2 cents. The deadlock actually seldom happened to me when I use Ethernet(only once actually). And for me the hang happened in loss.backward() I think. Could you add flush=True to your print statement and see it still hangs at the same place when it happens next time?


(Evangelos Kazakos) #5

Hi, I also have the same problem! Is there any solution to this? Thanks!


(Yuanxun Li) #6

Is there any update for this?

I have the similar problem. I added flush=True to print statement and found that the hang actually happened in loss.backward(). But the strange thing is that it didn’t happen at the first iteration.

My training code:

    for epoch in range(args.epoch):
        log('EPOCH %d' % epoch)
        if args.distributed:
            train_sampler.set_epoch(epoch)
        for i, data in enumerate(trainloader):
            log('ITER %d' % i)
            net.train()
            inputs, labels = Variable(data[0]), Variable(data[1])
            if args.use_gpu:
                inputs, labels = inputs.cuda(), labels.cuda()
            log('ITER %d DATA LOADED' % i)

            outputs = net.forward(inputs)
            loss = criterion(outputs, labels)
            log('ITER %d FORWARDED' % i)

            optimizer.zero_grad()
            log('ITER %d ZERO_GRAD' % i)

            loss.backward()
            log('ITER %d BACKWARDED' % i)

            optimizer.step()
            log('ITER %d STEP' % i)

The output is:

2018-04-17_10:19:18 EPOCH 0
2018-04-17_10:19:21 ITER 0
2018-04-17_10:19:21 ITER 0 DATA LOADED
2018-04-17_10:19:30 ITER 0 FORWARDED
2018-04-17_10:19:30 ITER 0 ZERO_GRAD
2018-04-17_10:19:32 ITER 0 BACKWARDED
2018-04-17_10:19:32 ITER 0 STEP
2018-04-17_10:19:32 ITER 1
2018-04-17_10:19:32 ITER 1 DATA LOADED
2018-04-17_10:19:33 ITER 1 FORWARDED
2018-04-17_10:19:33 ITER 1 ZERO_GRAD