[Solved]Problem only with DistributedDataParallel

ddeeppnneett · April 11, 2019, 10:53am

Using torch.nn.DataParallel with single GPU, everything works fine.

        os.environ["CUDA_VISIBLE_DEVICES"] = '0'
        model = resnet34()
        model = Tripletnet(model)
        model = torch.nn.DataParallel(model).cuda()

But when I switch to DistributedDataParallel with multi-gpu, I get an error

        model = resnet34()
        model = Tripletnet(model) 
        os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3'
        model  = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[0,1,2,3])

File “/home/slrum/code03/main02.py”, line 288, in train
loss.backward()
File “/opt/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/opt/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File “/opt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 445, in distributed_data_parallel_hook
self._queue_reduction(bucket_idx)
File “/opt/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py”, line 475, in _queue_reduction
self.device_ids)
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) → Tuple[torch.dist
ributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x2b9afa20d3b0>, [[None, tensor([[[[-1.2661e-04, 2.7497e-04, 1.3505e-
03, …, 4.6058e-03, 3.8858e-03, 4.0022e-03],
[ 1.4730e-03, 2.2759e-03, 5.1621e-03, …, 4.3713e-03,
8.9991e-04, 2.5457e-03],

What does this error mean? How could I fix it?

ddeeppnneett · April 11, 2019, 11:31am

There is an unused parameter in the model. Just remove it.