Hello,
I’m trying to use the distributed data parallel to train a resnet model on mulitple GPU on multiple nodes. The script is adapted from the ImageNet example code. After the script is started, it builds the module on all the GPUs, but it freezes when it tries to copy the data onto GPUs.
During the freezing time, all the GPUs has been allocated memories for the model, but the GPU utilization is 0% and stays at 0% for a long time. In the output file, it complaints
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Here are the scirpts and the corresponding output:
1. Testing the distributed package:
> import torch > import torch.distributed as dist > dist.init_process_group(backend='gloo', > init_method = 'file:///home/xqding/tmp/pytorch_dist/shared_file', > world_size = 4) > print('Hello from process {} (out of {})!'.format( > torch.distributed.get_rank(), torch.distributed.get_world_size())) > x = torch.Tensor([torch.distributed.get_rank()]) > torch.distributed.all_reduce(x) > print("value of x: {}".format(x))
This script works fine and this is the output:
Hello from process 3 (out of 4)!
Hello from process 2 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]value of x:
6
[torch.FloatTensor of size 1]Hello from process 1 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]Hello from process 0 (out of 4)!
value of x:
6
[torch.FloatTensor of size 1]
2. Resnet model using distributed data parallel
> import torch.nn as nn
> import torch.optim as optim
> from torch.autograd import Variable
> from read_data import *
> from torch.utils.data import Dataset, DataLoader
> from torch.utils.data.distributed import DistributedSampler
> from torch.nn.parallel import DistributedDataParallel
> from torchvision import transforms, utils
>
> dist = torch.distributed.init_process_group(backend = 'gloo',
> init_method = 'file:///home/xin/shared_file',
> world_size = 3)
> print("Rank:", torch.distributed.get_rank())
>
> net = models.ResNet(models.resnet.BasicBlock, [3, 4, 6, 3],
> num_classes = 100)
> net = torch.nn.parallel.DistributedDataParallel(net.cuda())
> criterion = nn.CrossEntropyLoss().cuda()
> optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9)
>
> train_data = Dataset(.....)
> train_sampler = DistributedSampler(train_data, num_replicas= 3, rank = torch.distributed.get_rank())
> train_loader = DataLoader(train_data, batch_size = 300,
> sampler = train_sampler,
> num_workers=5, pin_memory = True)
> num_epoches = 5
> for epoch in range(num_epoches):
> train_sampler.set_epoch(epoch)
> running_loss = 0.0
> for i, data in enumerate(train_loader, 0):
> print("Step:", i)
> # get the inputs
> inputs = data['image']
> labels = data['labels']
> print("Here 1")
>
> # wrap them in Variable
> labels = labels.cuda(async = True)
> print("Here 2")
> input_var = torch.autograd.Variable(inputs)
> print("Here 3")
> labels_var = torch.autograd.Vairable(labels)
>
> # zero the parameter gradients
> optimizer.zero_grad()
> print("Here 4")
> # forward + backward + optimize
> outputs = net(inputs_var)
> print("Here 5")
> loss=criterion(outputs,labels_var)
> loss.backward()
> optimizer.step()
The output is like this:
Rank: 0
Rank: 1
Rann: 2
Step: 0
Step: 0
Step: 0
Here 1
Here 1
Here 1
It never reaches the print('Here 2')
, which means it is freezing at the command labels = labels.cuda(async = True)
.
While it is freezing, the GPU status on the corresponding nodes is
Thu Sep 28 12:44:50 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69 Driver Version: 384.69 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 On | 00000000:02:00.0 Off | N/A |
| 26% 37C P2 41W / 180W | 420MiB / 4038MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 980 On | 00000000:82:00.0 Off | N/A |
| 26% 30C P2 41W / 180W | 354MiB / 4038MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 19947 C /home/xin/apps/anaconda3/bin/python 409MiB |
| 1 19947 C /home/xin/apps/anaconda3/bin/python 343MiB |
+-----------------------------------------------------------------------------+
Does anyone have some idea what is going on here?
Thanks.